Network Architecture - thebaulab/onramp GitHub Wiki

Network Architecture

This page documents the network infrastructure for the baulab.us cluster.

Understanding the Network

Why Can't I Just SSH Directly to the GPU Machines?

You might try ssh karakuri from your laptop and get "connection refused." That's because the GPU workstations are on a private network that isn't directly accessible from the internet.

Here's why this is actually a good thing:

  • Security: 25 machines directly on the internet = 25 potential attack targets
  • IP addresses: Khoury doesn't have 25 public IPs to spare
  • Management: It's easier to control access through one gateway

The Solution: Jump Host

Khoury provides a "jump host" - a machine that IS on the public internet and can relay your SSH connection to internal machines:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Your Laptop   β”‚     β”‚   login.khoury.northeastern  β”‚     β”‚   karakuri      β”‚
β”‚   (internet)    │────►│   (jump host)                │────►│   (internal)    β”‚
β”‚                 β”‚ SSH β”‚   Has public IP              β”‚ SSH β”‚   10.200.x.x    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This is why your ~/.ssh/config has:

Host karakuri
    ProxyJump login.khoury.northeastern.edu

The ProxyJump directive tells SSH: "First connect to login.khoury, then from there connect to karakuri."

Network Diagram

                           INTERNET
                              β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚   login.khoury.northeastern.edu   β”‚
              β”‚         (Jump Host)               β”‚
              β”‚   - SSH access only               β”‚
              β”‚   - Managed by Khoury IT          β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚ SSH
    ═════════════════════════β•ͺ═════════════════════════════════════
              Khoury Internal Network (10.200.0.0/16)
    ═════════════════════════β•ͺ═════════════════════════════════════
                             β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                    β”‚                    β”‚
        β–Ό                    β–Ό                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   baunames    β”‚    β”‚   baushare    β”‚    β”‚   bauserver   β”‚
β”‚ 10.200.205.143β”‚    β”‚ 10.200.205.57 β”‚    β”‚ 10.200.205.142β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β€’ LDAP Server β”‚    β”‚ β€’ NFS Server  β”‚    β”‚ β€’ Apache Web  β”‚
β”‚ β€’ Ansible Hub β”‚    β”‚ β€’ 84TB total  β”‚    β”‚ β€’ baulab.us   β”‚
β”‚ β€’ phpLDAPadminβ”‚    β”‚ β€’ /share/*    β”‚    β”‚ β€’ shell.baulabβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                    β”‚
        β”‚ LDAP Auth          β”‚ NFS
        β–Ό                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              GPU Workstations (~25 machines)            β”‚
β”‚  tokyo, osaka, karakuri, nagoya, hakone, etc.           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

What Each Server Does

Server Role Why It Matters
baunames LDAP + Ansible If this goes down, nobody can log in (unless cached)
baushare NFS storage If this goes down, home directories are inaccessible
bauserver Web server Hosts baulab.us website, less critical for daily work

Infrastructure Servers

Current Servers

Server IP Ubuntu Role Notes
baunames 10.200.205.143 20.04 LDAP, Ansible Primary auth server
baunames24 10.200.205.100 24.04 LDAP (new) Migration target
bauserver 10.200.205.142 20.04 Web server baulab.us
bauserver24 10.200.205.111 24.04 Web (new) Migration target
baushare 10.200.205.57 Synology NAS 84TB, 98% full!
baunas24 10.200.204.169 Synology NAS 98TB, 56% used

Why Two of Each Server?

You'll notice pairs like baunames/baunames24 and bauserver/bauserver24. This is for migration:

  • The "24" servers run Ubuntu 24.04 (newer)
  • We're gradually migrating from 20.04 to 24.04
  • During migration, both run simultaneously

GPU Workstations

These are the machines you'll actually use for running experiments.

Local Machines (at Khoury 177)

Hostname IP GPU Primary User Notes
tokyo 10.200.205.97 A6000 Eric
karasuno 10.200.205.139 A6000 Koyena
umibozu 10.200.205.169 A6000 Arnab
hawaii 10.200.205.185 A6000 Nikhil
karakuri 10.200.205.17 A6000 Shared Good for quick jobs
kyoto 10.200.205.136 2x A6000 Masters/Shared Two GPUs!
saitama 10.200.205.49 2x A6000 Shared Two GPUs!
hokkaido 10.200.205.103 A6000 Shared
andromeda 10.200.205.162 A6000 Alex
ei 129.10.112.50 A6000 David A Different subnet
kobe 10.200.205.179 A6000 Rohit
macondo 10.200.205.25 A6000 Sheridan
naoshima 10.200.204.124 A6000 Hendrik
bippu 10.200.205.70 A6000 Visiting
osaka 10.200.205.9 A6000 Michael
hamada 10.200.205.141 A6000 NDIF team
kumamoto 10.200.205.184 A6000 Adam
fukuyama 10.200.205.56 A6000 NDIF team
sendai 10.200.205.47 A6000 Shared
cancun 10.200.206.177 A6000 Can
kameoka 10.200.206.178 A6000 Shared
kawasaki 10.200.206.66 A6000 Andy
sapporo 10.200.205.29 A6000 -

"Primary User" Meaning

Machines with a primary user are informally reserved for that person's long-running jobs. Etiquette:

  • Check with them before using
  • Don't kill their processes
  • Shared machines are fair game

Remote GPU Servers (MGHPCC)

These machines are at the Massachusetts Green High Performance Computing Center in Holyoke, MA - about 100 miles from Boston.

Hostname IP GPU Notes
nagoya 10.201.22.179 8x A100 Very powerful! 80GB each
hakone 10.201.16.108 8x A100 Very powerful! 80GB each

Important for these machines:

  • Higher network latency to NFS storage
  • Use local /disk for working files - see Using-Nagoya
  • 8 GPUs means you can run distributed training

Storage Architecture

NAS Servers

Server Capacity Used What's Stored
baushare 84TB 98% /share/u, /share/projects, /share/backup
baunas24 98TB 56% /share/datasets

WARNING: baushare is almost full! Use baunas24 (/share/datasets) for large files.

Mount Points

Every machine in the cluster mounts these directories via NFS (actually via autofs - see NFS-Storage-Administration):

Path Server Purpose
/share/u/ baushare User home directories
/share/projects/ baushare Shared project folders
/share/backup/ baushare System backups
/share/datasets/ baunas24 Large datasets

Local Storage

Each GPU machine also has local disk at /disk/:

  • Much faster than NFS for I/O-heavy work
  • Not shared - files are only on that machine
  • Use for: conda environments, working copies of datasets, scratch files

How Authentication Works

When you SSH into any cluster machine, here's the auth flow:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. You run: ssh karakuri                                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚
                                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 2. karakuri asks: "Who is this user?"                                  β”‚
β”‚                                                                        β”‚
β”‚    SSSD (on karakuri) queries LDAP server (baunames)                   β”‚
β”‚    Returns: uid, gid, home directory                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚
                                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 3. karakuri asks: "Is the password/key correct?"                       β”‚
β”‚                                                                        β”‚
β”‚    For SSH keys: checks ~/.ssh/authorized_keys (on NFS)                β”‚
β”‚    For passwords: SSSD verifies against LDAP                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚
                                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 4. If authenticated:                                                   β”‚
β”‚    - Create session with your uid/gid                                  β”‚
β”‚    - cd to home directory /share/u/yourname                            β”‚
β”‚    - (autofs mounts NFS if needed)                                     β”‚
β”‚    - If first login, pam_mkhomedir creates home directory              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key insight: Your user info comes from LDAP, but your actual files come from NFS. Both need to be working for a normal login.

Web Services

baulab.us URLs

URL What's There
https://baulab.us/ Lab main website
https://baulab.us/u/USERNAME/ Personal web directories
https://baulab.us/p/PROJECT/ Project web directories
https://shell.baulab.us/ Web-based SSH terminal

How User Web Directories Work

Files in /share/u/yourname/public_html/ are served at https://baulab.us/u/yourname/.

This is useful for:

  • Hosting paper demos
  • Sharing results with collaborators
  • Visualizations

Comparison: baulab.us vs baukit.org

David runs a second cluster at home. The same user accounts exist on both:

Aspect baulab.us (Lab) baukit.org (Home)
Location Khoury 177 David's house
LDAP Server baunames names
Domain dc=baulab,dc=us dc=thevisible,dc=net
Jump Host login.khoury baukit.org
phpLDAPadmin Port 8877 Port 8876
Purpose Main GPU cluster Development, backup

When adding users, add them to both clusters so they can access both.

Common Network Tasks

Finding a Machine's IP

# On the machine itself
ip addr | grep "inet 10"

# From anywhere
host karakuri.khoury.northeastern.edu

Checking if a Machine is Up

ping karakuri

Testing SSH Connectivity

# Verbose mode shows connection steps
ssh -v karakuri

Checking Network Services

# See what ports are listening
sudo ss -tlnp

# Check if LDAP is reachable
ldapsearch -x -H ldap://baunames -b "dc=baulab,dc=us" "(uid=yourname)"

# Check if NFS is reachable
showmount -e baushare

Troubleshooting

Can't SSH to a Machine

  1. Check if it's up: ping machinename
  2. Check your SSH config: Is ProxyJump set correctly?
  3. Try verbose mode: ssh -v machinename to see where it fails
  4. Check the jump host: Can you SSH to login.khoury.northeastern.edu?

Authentication Failures

  1. LDAP down?: ldapsearch -x -H ldap://baunames -b "dc=baulab,dc=us" "(uid=yourname)"
  2. SSSD issue?: On the target machine: sudo systemctl status sssd
  3. Cached credentials: If LDAP was recently fixed, clear cache: sudo sss_cache -E

NFS Issues

  1. NAS down?: ping baushare
  2. Autofs issue?: sudo systemctl restart autofs
  3. **See NFS-Storage-Administration for more

Related Pages