Accessing Clusters - thebaulab/onramp GitHub Wiki

Accounts

If you're new to Khoury (e.g. a visitor) then get an account here - https://my.khoury.northeastern.edu/account/apply - that's different from your Northeastern email account. Different password. And then (sorry) there is a third/fourth password for Bau lab computers. Contact Arnab Sen Sharma for accounts on baulab/baukit. I recommend using ssh public keys to minimize password-typing. More details below.

baulab cluster at Khoury (baulab.us)

The workstations at the Bau lab at khoury are organized in a small cluster, physically located at various desks and closets at 177 Huntington. Your Bau lab account will allow you to log into any of the machines. Each has an A6000 GPU (or two).

To access these machines from outside Khoury, you will need to first ssh through login.khoury.northeastern.edu. Note that your username on login will be the one given to you by Khoury whereas the username on the workstations will be the one given to you by David, which is probably different.

You can use the following in your .ssh/config (e.g., on your laptop) to set it up

Host karasuno karakuri hawaii tokyo umibozu kyoto karakuri saitama bippu osaka hamada kumamoto fukuyama sendai andromeda hokkaido cancun kameoka
    ProxyJump login.khoury.northeastern.edu
    User [your username at baulab]

Host nagoya
    ProxyJump login.khoury.northeastern.edu
    HostName nagoya.research.khoury.northeastern.edu
    User [your username at baulab]

Host hakone
    ProxyJump login.khoury.northeastern.edu
    User [your username at baulab]

Host login.khoury.northeastern.edu
    User [your username at Khoury]

Note that each workstation is primarily used by one of the students, who may ask you to keep it clear in case they are actively using it. However if the machines are idle, feel free to use them.

  • nagoya.research.khoury.northeastern.edu (8 X A100s) - Shared
  • hakone.research.khoury.northeastern.edu (8 X A100s) - Shared
  • karasuno - Koyena
  • hawaii - Nikhil
  • tokyo - Eric
  • umibozu - Arnab
  • kyoto - Masters students / Shared
  • karakuri (dual gpu) - Shared
  • saitama (dual gpu) - Shared
  • hokkaido - Aaron
  • ei - David A
  • andromeda - Alex
  • kobe - Rohit
  • macondo - Sheridan
  • naoshima - demo machine for Hendrik Strobelt
  • bippu - Imke's visiting machine (hosting PatchExplorer)
  • kumamoto - Adam
  • osaka - Michael
  • fukuyama - ndif team
  • hamada - ndif team
  • sendai - shared
  • cancun - Can
  • kameoka - shared

This cluster also contains a webserver (internally called bauserver) that serves https://baulab.us/. In a pinch (e.g., in case of problems with the khoury login), you can log into the cluster directly over https using https://shell.baulab.us/.

baukit.org

baukit.org is David's personal GPU cluster, which includes a few machines at David's house. There are a few A6000 GPUs available there, and they are useful for students and collaborators who do not have access to university resources.

The computers are accessible via jump host as follows.

If you are on a mac or linux (where openssh is available), edit your ~/.ssh/config file to have the following lines:

Host uno quadro duo rtx
  User [your username is on baukit.org]
  ProxyJump baukit
  LocalForward localhost:8888 localhost:8888

Host baukit
  HostName baukit.org
  User [your username is on baukit.org]
  LocalForward localhost:8888 localhost:8888

Then you should be able to ssh uno (or etc) to get to one of the machines in baukit.org cluster. You may need to enter your password twice. (The config above also sets up port forwarding on port 8888 to your local machine; change to your favorite port for jupyter).

To avoid requiring your password, you should set up an ssh key. Do the folowing:

On your mac:

> ssh-keygen -t rsa

(Hit enter to accept defaults.) This creates two files: ~/.ssh/id_rsa, which is a secret key that you should never share or copy somewhere else. And ~/.ssh/id_rsa.pub, which is your public key which you will copy to anywhere that you will need it.

Then on your mac:

> ssh-copy-id [user]@uno

That command will ask for your password once or twice, maybe for the last time ever, in order to copy you id_rsa.pub public key to baukit.org. Now you should be able to ssh uno without entering your password.

The machines on baukit.org are ordinary Linux multiuser servers, which means that you can ssh to any machine you like, even if somebody else is already using it. who will tell you who else is logged into a machine, and you should use nvidia-smi and htop to see which GPU and CPU resources are being used. Pick a machine that is unused.

If you need some library on the machine, ask David to install it or ask for sudo membership.

Discovery HPC cluster

As a student at Northeastern you can have free access to the Northeastern Discovery cluster. You just fill out the form here. There's a group for some specific permissions called "baulab" that you can ask to be in.

Once your account is approved you should be able to ssh [username]@login.discovery.neu.edu. I shorten this to ssh discovery by adding the following to my .ssh/config:

Host discovery
    User [username]
    HostName login.discovery.neu.edu

Discovery is run as a supercomputer (HPC = high performance computing) cluster, which means that you need to use SLURM to queue up batch jobs, or to reserve machines to run interactively.

Read about how to use SLURM on Discovery here.

A typical sbatch file looks like this. Except at the end, instead of just running nvidia-smi you would select some modules, activate a conda environment, and then run your python program.

#!/bin/bash
#SBATCH --job-name=gpu_a100_check     # Job name
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=[username]@northeastern.edu
#SBATCH --partition=gpu               # Use the public GPU partition
#SBATCH --nodes=1                     # Run on a single CPU
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --gres=gpu:a100:1             # Run on a single a100
#SBATCH --mem=1gb                     # Job memory request
#SBATCH --time=00:05:00               # Time limit hrs:min:sec
#SBATCH --output=gpu_a100_check.log   # Standard output and error log

# The question: A100's come in two different memory sizes.  Which size do we have?

hostname
date
nvidia-smi

To queue up the job, you would save this file as something like gpu_a100_check.sbatch and then run sbatch gpu_a100_check.sbatch.

Then you can check on your queued jobs with squeue.

Instead of queuing a batch, you can also run an interactive session like this:

srun --partition=gpu --nodes=1 --ntasks=1 --gres=gpu:k40m:1 --mem=10G --pty /bin/bash

If you run this command, you will get a shell session inside a k40m GPU machine, and you can just use it interactively just like any other ssh session.