Accessing Clusters - thebaulab/onramp GitHub Wiki
Accounts
If you're new to Khoury (e.g. a visitor) then get an account here - https://my.khoury.northeastern.edu/account/apply - that's different from your Northeastern email account. Different password. And then (sorry) there is a third/fourth password for Bau lab computers. Contact Arnab Sen Sharma for accounts on baulab/baukit. I recommend using ssh public keys to minimize password-typing. More details below.
baulab cluster at Khoury (baulab.us)
The workstations at the Bau lab at khoury are organized in a small cluster, physically located at various desks and closets at 177 Huntington. Your Bau lab account will allow you to log into any of the machines. Each has an A6000 GPU (or two).
To access these machines from outside Khoury, you will need to first ssh through login.khoury.northeastern.edu
. Note that your username on login
will be the one given to you by Khoury whereas the username on the workstations will be the one given to you by David, which is probably different.
You can use the following in your .ssh/config
(e.g., on your laptop) to set it up
Host karasuno karakuri hawaii tokyo umibozu kyoto karakuri saitama bippu osaka hamada kumamoto fukuyama sendai andromeda hokkaido cancun kameoka
ProxyJump login.khoury.northeastern.edu
User [your username at baulab]
Host nagoya
ProxyJump login.khoury.northeastern.edu
HostName nagoya.research.khoury.northeastern.edu
User [your username at baulab]
Host hakone
ProxyJump login.khoury.northeastern.edu
User [your username at baulab]
Host login.khoury.northeastern.edu
User [your username at Khoury]
Note that each workstation is primarily used by one of the students, who may ask you to keep it clear in case they are actively using it. However if the machines are idle, feel free to use them.
- nagoya.research.khoury.northeastern.edu (8 X A100s) - Shared
- hakone.research.khoury.northeastern.edu (8 X A100s) - Shared
- karasuno - Koyena
- hawaii - Nikhil
- tokyo - Eric
- umibozu - Arnab
- kyoto - Masters students / Shared
- karakuri (dual gpu) - Shared
- saitama (dual gpu) - Shared
- hokkaido - Aaron
- ei - David A
- andromeda - Alex
- kobe - Rohit
- macondo - Sheridan
- naoshima - demo machine for Hendrik Strobelt
- bippu - Imke's visiting machine (hosting PatchExplorer)
- kumamoto - Adam
- osaka - Michael
- fukuyama - ndif team
- hamada - ndif team
- sendai - shared
- cancun - Can
- kameoka - shared
This cluster also contains a webserver (internally called bauserver
) that serves https://baulab.us/
. In a pinch (e.g., in case of problems with the khoury login), you can log into the cluster directly over https using https://shell.baulab.us/
.
baukit.org
baukit.org
is David's personal GPU cluster, which includes a few machines at David's house. There are a few A6000 GPUs available there, and they are useful for students and collaborators who do not have access to university resources.
The computers are accessible via jump host as follows.
If you are on a mac or linux (where openssh is available), edit your ~/.ssh/config
file to have the following lines:
Host uno quadro duo rtx
User [your username is on baukit.org]
ProxyJump baukit
LocalForward localhost:8888 localhost:8888
Host baukit
HostName baukit.org
User [your username is on baukit.org]
LocalForward localhost:8888 localhost:8888
Then you should be able to ssh uno
(or etc) to get to one of the machines in baukit.org
cluster. You may need to enter your password twice. (The config above also sets up port forwarding on port 8888 to your local machine; change to your favorite port for jupyter).
To avoid requiring your password, you should set up an ssh key. Do the folowing:
On your mac:
> ssh-keygen -t rsa
(Hit enter to accept defaults.) This creates two files: ~/.ssh/id_rsa
, which is a secret key that you should never share or copy somewhere else. And ~/.ssh/id_rsa.pub
, which is your public key which you will copy to anywhere that you will need it.
Then on your mac:
> ssh-copy-id [user]@uno
That command will ask for your password once or twice, maybe for the last time ever, in order to copy you id_rsa.pub
public key to baukit.org
. Now you should be able to ssh uno
without entering your password.
The machines on baukit.org
are ordinary Linux multiuser servers, which means that you can ssh to any machine you like, even if somebody else is already using it. who
will tell you who else is logged into a machine, and you should use nvidia-smi
and htop
to see which GPU and CPU resources are being used. Pick a machine that is unused.
If you need some library on the machine, ask David to install it or ask for sudo
membership.
Discovery HPC cluster
As a student at Northeastern you can have free access to the Northeastern Discovery cluster. You just fill out the form here. There's a group for some specific permissions called "baulab" that you can ask to be in.
Once your account is approved you should be able to ssh [username]@login.discovery.neu.edu
. I shorten this to ssh discovery
by adding the following to my .ssh/config
:
Host discovery
User [username]
HostName login.discovery.neu.edu
Discovery is run as a supercomputer (HPC = high performance computing) cluster, which means that you need to use SLURM to queue up batch jobs, or to reserve machines to run interactively.
Read about how to use SLURM on Discovery here.
A typical sbatch file looks like this. Except at the end, instead of just running nvidia-smi
you would select some modules, activate a conda environment, and then run your python program.
#!/bin/bash
#SBATCH --job-name=gpu_a100_check # Job name
#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=[username]@northeastern.edu
#SBATCH --partition=gpu # Use the public GPU partition
#SBATCH --nodes=1 # Run on a single CPU
#SBATCH --ntasks=1 # Run on a single CPU
#SBATCH --gres=gpu:a100:1 # Run on a single a100
#SBATCH --mem=1gb # Job memory request
#SBATCH --time=00:05:00 # Time limit hrs:min:sec
#SBATCH --output=gpu_a100_check.log # Standard output and error log
# The question: A100's come in two different memory sizes. Which size do we have?
hostname
date
nvidia-smi
To queue up the job, you would save this file as something like gpu_a100_check.sbatch
and then run sbatch gpu_a100_check.sbatch
.
Then you can check on your queued jobs with squeue
.
Instead of queuing a batch, you can also run an interactive session like this:
srun --partition=gpu --nodes=1 --ntasks=1 --gres=gpu:k40m:1 --mem=10G --pty /bin/bash
If you run this command, you will get a shell session inside a k40m GPU machine, and you can just use it interactively just like any other ssh session.