Using Savio - theunissenlab/lab-documentation GitHub Wiki

Savio is one of three Berkeley Research Computing (BRC) high performance computing (HPC) clusters provided by the university.

Tutorial intro from BRC here

Note the Theunissen lab has two projects, "fc_birdpow" (free allowance, depleted until May/June 2018) and "ac_birdpow" (paid allowance, about 100000 Service Units left as of 2/2018). These 2 project names, also known in the BRC docs as "account names" should be interchangeable everywhere they're used on this page

Getting a user account

(and associated "once only" steps)

See Getting an account and logging into BRC clusters

  1. Fill out this account request form. Choose "Savio" as the cluster you'd like to access. This step requires manual approval and will take 1-2 days.

  2. You should get an email requesting you to fill out this access agreement form

  3. Once your account is active, fill out this form to link your CalNet (or Facebook/Google/LinkedIn) account to your BRC HPC account.

  4. After a few minutes, you should get an email with instructions to complete the linking.

  5. Download and open the Google Authenticator app on your phone (this will be used to create one-time passwords for logging into Savio)

  6. Once your account is linked, go to the "Non-LBL Token Management page" on another device (not the one running Google Authenticator), add a token, and scan the barcode from the Authenticator app. Remember the PIN you chose when you made the token.

Connecting to Savio

  1. Open the Google Authenticator app on your phone

  2. The login nodes are hpc.brc.berkeley.edu: ssh <username>@hpc.brc.berkeley.edu. At the password prompt, enter the PIN you chose when you made the token, followed immediately by the 6-digit one-time password displayed in the app. For example, if your PIN is 1234 and the code in the app is 888 888, you should enter "1234888888".

Transferring data to Savio

The data transfer node is dtn.brc.berkeley.edu. From a lab machine, to transfer to your home directory:

scp <local file> <username>@dtn.brc.berkeley.edu:/global/home/users/<username>

To transfer to your scratch directory:

scp <local file> <username>@dtn.brc.berkeley.edu:/global/scratch/<username>

(same login procedure as described above)

Note scp and rsync will fail if your .bashrc writes any output. Put lines that write output under if shopt -q login_shell; then [...]; fi to fix.

Accessing the lab network from savio

It will probably be useful to access the lab network and tdrive from savio in order to copy data to the HPC cluster. Here's how to do that.

  1. While on a savio node, create a rsa key in /global/home/users/username/.ssh with ssh-keygen -t rsa -f id_rsa_savio

  2. Edit /global/home/uesrs/username/.ssh/config on savio to tell it to use the correct key and point it to the host (zebra)

Host TheunissenLab
  User USERNAME_ON_FET_CLUSTER
  IdentityFile ~/.ssh/id_rsa_savio
  HostName 169.229.219.171
  1. Copy the public key onto the FET cluster into /auto/fhome/USERNAME_ON_FET_CLUSTER/.ssh/ directory

Using Savio

See Running your jobs and Accessing and installing software

The SLURM Scheduler is used to manage the workload on Savio.

Some useful commands

  • Start interactive session: srun --pty -A fc_birdpow -p savio2 -t 00:00:30 bash -i
  • Submit job: sbatch myjob.sh (See example job scripts at Running your jobs)
  • View the job queue: squeue or squeue --user <username>
  • Cancel job #123: scancel 123
  • Cancel all your jobs: scancel -u <username>
  • View your permissions: sacctmgr -p show associations user=<username>
  • Check project usage allowance: check_usage.sh -a fc_birdpow (If the project is out of allowance, it will say "0 jobs, 0 hours; allowance exceeded" - to see the actual usage, specify the most recent June 1st as the start date with -s: check_usage.sh -s 2017-06-01 -a fc_birdpow) Service Units defined here
  • Check your own usage: check_usage.sh -u <username>
  • View memory usage of a job: sacct -o MaxRSS -j JOBID (f your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD)

Notes on parallelization

  • Request multiple tasks for your job with the --ntasks=<n> parameter in your job script (or sbatch command). Then any command executed as srun <command> in your job script will be executed n times in an environment with the environment variable SLURM_PROCID set to 0, 1, ..., n.
  • You can specify the distribution of tasks across nodes using --nodes=<n> and --ntasks-per-node=<m> in your job script (or sbatch command). If you're doing this kind of thing, there are tons more options that you should read about elsewhere.
  • If you are running code that can take advantage of multiple CPUs (matlab's parfor?), request multiple CPUs with the --cpus-per-task=<n> parameter in your job script (or sbatch command).
  • A cheap way to parallelize independent jobs (parameter sweeps, Monte Carlo experiments, etc.) is to use the --array=1-<n> option in your job script. This is essentially equivalent to running sbatch myjob.sh n times. Your code can figure out its instance's array index (and thus what work to perform) by reading the SLURM_ARRAY_TASK_ID environment variable (eg. in Python: import os; os.environ['SLURM_ARRAY_TASK_ID'])

Random notes

  • To use pip: module load python/2.7 (or 3.5 or 3.6)
  • Consider putting in your job script #SBATCH --mail-type=ALL and #SBATCH [email protected] for email alerts

Using Theano on the Savio GPU cluster

Installation/setup

Theano 0.9.0 is installed by default on Savio nodes. I was not able to get any other version of Theano to work with the GPUs.

module load python/2.7         # installs pip; other versions also available
module load cuda/7.5           # other versions also available, but I could not get them to work
pip install scikit-cuda --user

FWIW, I installed everything except Theano into a clean Conda env

In ~/.theanorc:

[nvcc]
flags=-D_FORCE_INLINES

[cuda]
root=/global/software/sl-7.x86_64/modules/langs/cuda/7.5/
 
[global]
# device = gpu   # I don't recommend this unless you only use 1 GPU (see below)
floatX = float32

Job configuration

You definitely need at least the following lines in your job script:

  • #SBATCH --partition=savio2_gpu
  • #SBATCH --gres=gpu:<n> where <n> is an integer number of GPUs to request

Note: savio requires you to use 2 CPUs per GPU requested, or you will get error: Unable to allocate resources: Invalid generic resource (gres) specification. So set --ntasks and --cpus-per-task appropriately

To parallelize as efficiently as possible (for my purposes), I used the following configuration (slurm job script and python script):

#SBATCH --partition=savio2_gpu
#SBATCH --gres=gpu:4
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=2
#SBATCH --array=1-3

Each node in the savio2_gpu partition has 4 GPUs, so I request a single node (--nodes=1), all of its GPUs (--gres=gpu:4), and 4 tasks on that node (--ntasks-per-node=4). Then I copy this configuration 3 times (--array=1-3) for a total of 12 tasks using 12 GPUs and 24 CPUs (--cpus-per-task=2).

Each of my 4 tasks on each node selects which GPU to use by setting the THEANO_FLAGS environment variable based on the value of the SLURM_PROCID environment variable. In bash, this would look something like export THEANO_FLAGS=$THEANO_FLAGS,device=gpu${SLURM_PROCID}. This must be done before theano is imported, or before creating a subprocess that imports theano.

Random notes

  • This page has a simple script to verify that you are using the GPU (and not the CPU). It runs fast (few seconds), so doesn't hurt to run at the beginning of every job
  • To see GPU information: nvidia-smi (only from within an interactive session where you have requested GPUs, obviously)
  • If you use the Python subprocess approach, I found that setting shell=True in the call to subprocess.call() fixes many issues.
⚠️ **GitHub.com Fallback** ⚠️