Network Architecture - thebaulab/onramp GitHub Wiki
Network Architecture
This page documents the network infrastructure for the baulab.us cluster.
Understanding the Network
Why Can't I Just SSH Directly to the GPU Machines?
You might try ssh karakuri from your laptop and get "connection refused." That's because the GPU workstations are on a private network that isn't directly accessible from the internet.
Here's why this is actually a good thing:
- Security: 25 machines directly on the internet = 25 potential attack targets
- IP addresses: Khoury doesn't have 25 public IPs to spare
- Management: It's easier to control access through one gateway
The Solution: Jump Host
Khoury provides a "jump host" - a machine that IS on the public internet and can relay your SSH connection to internal machines:
βββββββββββββββββββ ββββββββββββββββββββββββββββββββ βββββββββββββββββββ
β Your Laptop β β login.khoury.northeastern β β karakuri β
β (internet) ββββββΊβ (jump host) ββββββΊβ (internal) β
β β SSH β Has public IP β SSH β 10.200.x.x β
βββββββββββββββββββ ββββββββββββββββββββββββββββββββ βββββββββββββββββββ
This is why your ~/.ssh/config has:
Host karakuri
ProxyJump login.khoury.northeastern.edu
The ProxyJump directive tells SSH: "First connect to login.khoury, then from there connect to karakuri."
Network Diagram
INTERNET
β
βββββββββββββββββΌββββββββββββββββββββ
β login.khoury.northeastern.edu β
β (Jump Host) β
β - SSH access only β
β - Managed by Khoury IT β
ββββββββββββββββ¬βββββββββββββββββββββ
β SSH
ββββββββββββββββββββββββββͺβββββββββββββββββββββββββββββββββββββ
Khoury Internal Network (10.200.0.0/16)
ββββββββββββββββββββββββββͺβββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββΌβββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ
β baunames β β baushare β β bauserver β
β 10.200.205.143β β 10.200.205.57 β β 10.200.205.142β
βββββββββββββββββ€ βββββββββββββββββ€ βββββββββββββββββ€
β β’ LDAP Server β β β’ NFS Server β β β’ Apache Web β
β β’ Ansible Hub β β β’ 84TB total β β β’ baulab.us β
β β’ phpLDAPadminβ β β’ /share/* β β β’ shell.baulabβ
βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ
β β
β LDAP Auth β NFS
βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GPU Workstations (~25 machines) β
β tokyo, osaka, karakuri, nagoya, hakone, etc. β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
What Each Server Does
| Server | Role | Why It Matters |
|---|---|---|
| baunames | LDAP + Ansible | If this goes down, nobody can log in (unless cached) |
| baushare | NFS storage | If this goes down, home directories are inaccessible |
| bauserver | Web server | Hosts baulab.us website, less critical for daily work |
Infrastructure Servers
Current Servers
| Server | IP | Ubuntu | Role | Notes |
|---|---|---|---|---|
| baunames | 10.200.205.143 | 20.04 | LDAP, Ansible | Primary auth server |
| baunames24 | 10.200.205.100 | 24.04 | LDAP (new) | Migration target |
| bauserver | 10.200.205.142 | 20.04 | Web server | baulab.us |
| bauserver24 | 10.200.205.111 | 24.04 | Web (new) | Migration target |
| baushare | 10.200.205.57 | Synology | NAS | 84TB, 98% full! |
| baunas24 | 10.200.204.169 | Synology | NAS | 98TB, 56% used |
Why Two of Each Server?
You'll notice pairs like baunames/baunames24 and bauserver/bauserver24. This is for migration:
- The "24" servers run Ubuntu 24.04 (newer)
- We're gradually migrating from 20.04 to 24.04
- During migration, both run simultaneously
GPU Workstations
These are the machines you'll actually use for running experiments.
Local Machines (at Khoury 177)
| Hostname | IP | GPU | Primary User | Notes |
|---|---|---|---|---|
| tokyo | 10.200.205.97 | A6000 | Eric | |
| karasuno | 10.200.205.139 | A6000 | Koyena | |
| umibozu | 10.200.205.169 | A6000 | Arnab | |
| hawaii | 10.200.205.185 | A6000 | Nikhil | |
| karakuri | 10.200.205.17 | A6000 | Shared | Good for quick jobs |
| kyoto | 10.200.205.136 | 2x A6000 | Masters/Shared | Two GPUs! |
| saitama | 10.200.205.49 | 2x A6000 | Shared | Two GPUs! |
| hokkaido | 10.200.205.103 | A6000 | Shared | |
| andromeda | 10.200.205.162 | A6000 | Alex | |
| ei | 129.10.112.50 | A6000 | David A | Different subnet |
| kobe | 10.200.205.179 | A6000 | Rohit | |
| macondo | 10.200.205.25 | A6000 | Sheridan | |
| naoshima | 10.200.204.124 | A6000 | Hendrik | |
| bippu | 10.200.205.70 | A6000 | Visiting | |
| osaka | 10.200.205.9 | A6000 | Michael | |
| hamada | 10.200.205.141 | A6000 | NDIF team | |
| kumamoto | 10.200.205.184 | A6000 | Adam | |
| fukuyama | 10.200.205.56 | A6000 | NDIF team | |
| sendai | 10.200.205.47 | A6000 | Shared | |
| cancun | 10.200.206.177 | A6000 | Can | |
| kameoka | 10.200.206.178 | A6000 | Shared | |
| kawasaki | 10.200.206.66 | A6000 | Andy | |
| sapporo | 10.200.205.29 | A6000 | - |
"Primary User" Meaning
Machines with a primary user are informally reserved for that person's long-running jobs. Etiquette:
- Check with them before using
- Don't kill their processes
- Shared machines are fair game
Remote GPU Servers (MGHPCC)
These machines are at the Massachusetts Green High Performance Computing Center in Holyoke, MA - about 100 miles from Boston.
| Hostname | IP | GPU | Notes |
|---|---|---|---|
| nagoya | 10.201.22.179 | 8x A100 | Very powerful! 80GB each |
| hakone | 10.201.16.108 | 8x A100 | Very powerful! 80GB each |
Important for these machines:
- Higher network latency to NFS storage
- Use local
/diskfor working files - see Using-Nagoya - 8 GPUs means you can run distributed training
Storage Architecture
NAS Servers
| Server | Capacity | Used | What's Stored |
|---|---|---|---|
| baushare | 84TB | 98% | /share/u, /share/projects, /share/backup |
| baunas24 | 98TB | 56% | /share/datasets |
WARNING: baushare is almost full! Use baunas24 (/share/datasets) for large files.
Mount Points
Every machine in the cluster mounts these directories via NFS (actually via autofs - see NFS-Storage-Administration):
| Path | Server | Purpose |
|---|---|---|
/share/u/ |
baushare | User home directories |
/share/projects/ |
baushare | Shared project folders |
/share/backup/ |
baushare | System backups |
/share/datasets/ |
baunas24 | Large datasets |
Local Storage
Each GPU machine also has local disk at /disk/:
- Much faster than NFS for I/O-heavy work
- Not shared - files are only on that machine
- Use for: conda environments, working copies of datasets, scratch files
How Authentication Works
When you SSH into any cluster machine, here's the auth flow:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. You run: ssh karakuri β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. karakuri asks: "Who is this user?" β
β β
β SSSD (on karakuri) queries LDAP server (baunames) β
β Returns: uid, gid, home directory β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. karakuri asks: "Is the password/key correct?" β
β β
β For SSH keys: checks ~/.ssh/authorized_keys (on NFS) β
β For passwords: SSSD verifies against LDAP β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. If authenticated: β
β - Create session with your uid/gid β
β - cd to home directory /share/u/yourname β
β - (autofs mounts NFS if needed) β
β - If first login, pam_mkhomedir creates home directory β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key insight: Your user info comes from LDAP, but your actual files come from NFS. Both need to be working for a normal login.
Web Services
baulab.us URLs
| URL | What's There |
|---|---|
| https://baulab.us/ | Lab main website |
| https://baulab.us/u/USERNAME/ | Personal web directories |
| https://baulab.us/p/PROJECT/ | Project web directories |
| https://shell.baulab.us/ | Web-based SSH terminal |
How User Web Directories Work
Files in /share/u/yourname/public_html/ are served at https://baulab.us/u/yourname/.
This is useful for:
- Hosting paper demos
- Sharing results with collaborators
- Visualizations
Comparison: baulab.us vs baukit.org
David runs a second cluster at home. The same user accounts exist on both:
| Aspect | baulab.us (Lab) | baukit.org (Home) |
|---|---|---|
| Location | Khoury 177 | David's house |
| LDAP Server | baunames | names |
| Domain | dc=baulab,dc=us | dc=thevisible,dc=net |
| Jump Host | login.khoury | baukit.org |
| phpLDAPadmin | Port 8877 | Port 8876 |
| Purpose | Main GPU cluster | Development, backup |
When adding users, add them to both clusters so they can access both.
Common Network Tasks
Finding a Machine's IP
# On the machine itself
ip addr | grep "inet 10"
# From anywhere
host karakuri.khoury.northeastern.edu
Checking if a Machine is Up
ping karakuri
Testing SSH Connectivity
# Verbose mode shows connection steps
ssh -v karakuri
Checking Network Services
# See what ports are listening
sudo ss -tlnp
# Check if LDAP is reachable
ldapsearch -x -H ldap://baunames -b "dc=baulab,dc=us" "(uid=yourname)"
# Check if NFS is reachable
showmount -e baushare
Troubleshooting
Can't SSH to a Machine
- Check if it's up:
ping machinename - Check your SSH config: Is
ProxyJumpset correctly? - Try verbose mode:
ssh -v machinenameto see where it fails - Check the jump host: Can you SSH to
login.khoury.northeastern.edu?
Authentication Failures
- LDAP down?:
ldapsearch -x -H ldap://baunames -b "dc=baulab,dc=us" "(uid=yourname)" - SSSD issue?: On the target machine:
sudo systemctl status sssd - Cached credentials: If LDAP was recently fixed, clear cache:
sudo sss_cache -E
NFS Issues
- NAS down?:
ping baushare - Autofs issue?:
sudo systemctl restart autofs - **See NFS-Storage-Administration for more
Related Pages
- LDAP-Server-Administration - LDAP server details
- NFS-Storage-Administration - Storage management
- Ansible-Playbook-Reference - Configuration management
- Server-Migration-Guide - Hardware migration
- Accessing-Clusters - User guide to connecting