Accessing Clusters - thebaulab/onramp GitHub Wiki

Accounts

If you're new to Khoury (e.g. a visitor) then get an account here - https://my.khoury.northeastern.edu/account/apply - that's different from your Northeastern email account. Different password. And then (sorry) there is a third/fourth password for Bau lab computers. Contact Arnab Sen Sharma for accounts on baulab/baukit. I recommend using ssh public keys to minimize password-typing. More details below.

baulab cluster at Khoury (baulab.us)

The workstations at the Bau lab at khoury are organized in a small cluster, physically located at various desks and closets at 177 Huntington. Your Bau lab account will allow you to log into any of the machines. Each has an A6000 GPU (or two).

To access these machines from outside Khoury, you will need to first ssh through login.khoury.northeastern.edu. Note that your username on login will be the one given to you by Khoury whereas the username on the workstations will be the one given to you by David, which is probably different.

You can use the following in your .ssh/config (e.g., on your laptop) to set it up

Host karasuno karakuri hawaii tokyo umibozu kyoto karakuri saitama bippu osaka hamada kumamoto fukuyama sendai andromeda hokkaido cancun kameoka kawasaki
    ProxyJump login.khoury.northeastern.edu
    User [your username at baulab]

Host nagoya
    ProxyJump login.khoury.northeastern.edu
    HostName nagoya.research.khoury.northeastern.edu
    User [your username at baulab]

Host hakone
    ProxyJump login.khoury.northeastern.edu
    User [your username at baulab]

Host login.khoury.northeastern.edu
    User [your username at Khoury]

Note that each workstation is primarily used by one of the students, who may ask you to keep it clear in case they are actively using it. However if the machines are idle, feel free to use them.

nagoya.research.khoury.northeastern.edu (8 X A100s) - Shared
hakone.research.khoury.northeastern.edu (8 X A100s) - Shared
saitama (dual gpu) - Shared
kyoto (dual gpu) - Masters students / Shared
sendai - Shared
kameoka - Shared
hokkaido - Shared
karakuri - Shared
fukuyama - NDIF team (Adam)
hamada - NDIF team
karasuno - Koyena
hawaii - Nikhil
tokyo - Eric
umibozu - Arnab
ei - David A
andromeda - Alex
kobe - Rohit
macondo - Sheridan
kumamoto - Adam
osaka - Michael
cancun - Can
kawasaki - Andy
naoshima - demo machine for Hendrik Strobelt
bippu - Imke's visiting machine (hosting PatchExplorer)

This cluster also contains a webserver (internally called bauserver) that serves https://baulab.us/. In a pinch (e.g., in case of problems with the khoury login), you can log into the cluster directly over https using https://shell.baulab.us/.

Passwordless SSH

Create an ssh key on your local machine

On your local machine heck if you already have an existing ssh key at the path ~/.ssh/id_rsa.pub. If it exists, great. If it doesn't exist, create one by running the following command:

ssh-keygen -t rsa

Copy your ssh key to the login node, and to a Bau Lab node

From your local machine, run the following two commands:

# copy ssh key to login node
ssh-copy-id -i ~/.ssh/id_rsa.pub {your_khoury_username}@login.khoury.northeastern.edu

# copy ssh key to a Bau Lab node (note hokkaido can be replaced with any Bau Lab node name)
ssh-copy-id -i ~/.ssh/id_rsa.pub -o ProxyJump=login.khoury.northeastern.edu {your_bau_lab_username}@hokkaido

baukit.org

baukit.org is David's personal GPU cluster, which includes a few machines at David's house. There are a few A6000 GPUs available there, and they are useful for students and collaborators who do not have access to university resources.

The computers are accessible via jump host as follows.

If you are on a mac or linux (where openssh is available), edit your ~/.ssh/config file to have the following lines:

Host uno quadro duo rtx
  User [your username is on baukit.org]
  ProxyJump baukit
  LocalForward localhost:8888 localhost:8888

Host baukit
  HostName baukit.org
  User [your username is on baukit.org]
  LocalForward localhost:8888 localhost:8888

Then you should be able to ssh uno (or etc) to get to one of the machines in baukit.org cluster. You may need to enter your password twice. (The config above also sets up port forwarding on port 8888 to your local machine; change to your favorite port for jupyter).

To avoid requiring your password, you should set up an ssh key. Do the folowing:

On your mac:

> ssh-keygen -t rsa

(Hit enter to accept defaults.) This creates two files: ~/.ssh/id_rsa, which is a secret key that you should never share or copy somewhere else. And ~/.ssh/id_rsa.pub, which is your public key which you will copy to anywhere that you will need it.

Then on your mac:

> ssh-copy-id [user]@uno

That command will ask for your password once or twice, maybe for the last time ever, in order to copy you id_rsa.pub public key to baukit.org. Now you should be able to ssh uno without entering your password.

The machines on baukit.org are ordinary Linux multiuser servers, which means that you can ssh to any machine you like, even if somebody else is already using it. who will tell you who else is logged into a machine, and you should use nvidia-smi and htop to see which GPU and CPU resources are being used. Pick a machine that is unused.

If you need some library on the machine, ask David to install it or ask for sudo membership.

Northeastern HPC cluster

As a student at Northeastern you can have free access to the Northeastern Explorer cluster. You must ask David, or he can designate a delegate, to add you as a user directly to "baulab" on the cluster. Explorer is the successor to Discovery.

Once your account is added you should be able to ssh <username>@login.explorer.northeastern.edu. You can shorten this to ssh explorer by adding the following to .ssh/config:

Host explorer
    User [username]
    HostName login.explorer.northeastern.edu

Explorer is run as a supercomputer (HPC = high performance computing) cluster, which means that you need to use SLURM to queue up batch jobs, or to reserve machines to run interactively.

Read about how to queue jobs on Explorer here.

A typical sbatch file looks like this. Except at the end, instead of just running nvidia-smi you would select some modules, activate a conda environment, and then run your python program.

#!/bin/bash
#SBATCH --job-name=my_h200_run        # Job name
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=[username]@northeastern.edu
#SBATCH --partition=gpu               # Use the public GPU partition
#SBATCH --nodes=1                     # Run on a single CPU
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --gres=gpu:h200:1             # Run on a single H200 GPU
#SBATCH --mem=4GB                     # Job memory request
#SBATCH --time=00:05:00               # Time limit hrs:min:sec
#SBATCH --output=myjob.%j.out         # stdout log
#SBATCH --error=myjob.%j.err          # stderr log

# How to check the allocated host, date, and GPU specs

hostname
date
nvidia-smi

## <your code>

To queue up the job, you would save this file as something like my_h200_run.sbatch and then run sbatch my_h200_run.sbatch.

Then you can check on your queued jobs with squeue.

Instead of queuing a batch, you can also run an interactive session like this:

srun --partition=gpu --nodes=1 --ntasks=1 --gres=gpu:k40m:1 --mem=10G --pty /bin/bash

If you run this command, you will get a shell session inside a k40m GPU machine, and you can just use it interactively just like any other ssh session.