Using Savio - theunissenlab/lab-documentation GitHub Wiki
Savio is one of three Berkeley Research Computing (BRC) high performance computing (HPC) clusters provided by the university.
Note the Theunissen lab has two projects, "fc_birdpow" (free allowance, depleted until May/June 2018) and "ac_birdpow" (paid allowance, about 100000 Service Units left as of 2/2018). These 2 project names, also known in the BRC docs as "account names" should be interchangeable everywhere they're used on this page
(and associated "once only" steps)
See Getting an account and logging into BRC clusters
-
Fill out this account request form. Choose "Savio" as the cluster you'd like to access. This step requires manual approval and will take 1-2 days.
-
You should get an email requesting you to fill out this access agreement form
-
Once your account is active, fill out this form to link your CalNet (or Facebook/Google/LinkedIn) account to your BRC HPC account.
-
After a few minutes, you should get an email with instructions to complete the linking.
-
Download and open the Google Authenticator app on your phone (this will be used to create one-time passwords for logging into Savio)
-
Once your account is linked, go to the "Non-LBL Token Management page" on another device (not the one running Google Authenticator), add a token, and scan the barcode from the Authenticator app. Remember the PIN you chose when you made the token.
-
Open the Google Authenticator app on your phone
-
The login nodes are hpc.brc.berkeley.edu:
ssh <username>@hpc.brc.berkeley.edu
. At the password prompt, enter the PIN you chose when you made the token, followed immediately by the 6-digit one-time password displayed in the app. For example, if your PIN is 1234 and the code in the app is 888 888, you should enter "1234888888".
The data transfer node is dtn.brc.berkeley.edu. From a lab machine, to transfer to your home directory:
scp <local file> <username>@dtn.brc.berkeley.edu:/global/home/users/<username>
To transfer to your scratch directory:
scp <local file> <username>@dtn.brc.berkeley.edu:/global/scratch/<username>
(same login procedure as described above)
Note scp
and rsync
will fail if your .bashrc writes any output. Put lines that write output under if shopt -q login_shell; then [...]; fi
to fix.
It will probably be useful to access the lab network and tdrive from savio in order to copy data to the HPC cluster. Here's how to do that.
-
While on a savio node, create a rsa key in
/global/home/users/username/.ssh
withssh-keygen -t rsa -f id_rsa_savio
-
Edit
/global/home/uesrs/username/.ssh/config
on savio to tell it to use the correct key and point it to the host (zebra)
Host TheunissenLab
User USERNAME_ON_FET_CLUSTER
IdentityFile ~/.ssh/id_rsa_savio
HostName 169.229.219.171
- Copy the public key onto the FET cluster into
/auto/fhome/USERNAME_ON_FET_CLUSTER/.ssh/
directory
See Running your jobs and Accessing and installing software
The SLURM Scheduler is used to manage the workload on Savio.
- Start interactive session:
srun --pty -A fc_birdpow -p savio2 -t 00:00:30 bash -i
- Submit job:
sbatch myjob.sh
(See example job scripts at Running your jobs) - View the job queue:
squeue
orsqueue --user <username>
- Cancel job #123:
scancel 123
- Cancel all your jobs:
scancel -u <username>
- View your permissions:
sacctmgr -p show associations user=<username>
- Check project usage allowance:
check_usage.sh -a fc_birdpow
(If the project is out of allowance, it will say "0 jobs, 0 hours; allowance exceeded" - to see the actual usage, specify the most recent June 1st as the start date with-s
:check_usage.sh -s 2017-06-01 -a fc_birdpow
) Service Units defined here - Check your own usage:
check_usage.sh -u <username>
- View memory usage of a job:
sacct -o MaxRSS -j JOBID
(f your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with-S YYYY-MM-DD
)
- Request multiple tasks for your job with the
--ntasks=<n>
parameter in your job script (orsbatch
command). Then any command executed assrun <command>
in your job script will be executedn
times in an environment with the environment variableSLURM_PROCID
set to 0, 1, ..., n. - You can specify the distribution of tasks across nodes using
--nodes=<n>
and--ntasks-per-node=<m>
in your job script (orsbatch
command). If you're doing this kind of thing, there are tons more options that you should read about elsewhere. - If you are running code that can take advantage of multiple CPUs (matlab's
parfor
?), request multiple CPUs with the--cpus-per-task=<n>
parameter in your job script (orsbatch
command). - A cheap way to parallelize independent jobs (parameter sweeps, Monte Carlo experiments, etc.) is to use the
--array=1-<n>
option in your job script. This is essentially equivalent to runningsbatch myjob.sh
n
times. Your code can figure out its instance's array index (and thus what work to perform) by reading theSLURM_ARRAY_TASK_ID
environment variable (eg. in Python:import os; os.environ['SLURM_ARRAY_TASK_ID']
)
- To use pip:
module load python/2.7
(or 3.5 or 3.6) - Consider putting in your job script
#SBATCH --mail-type=ALL
and#SBATCH [email protected]
for email alerts
Theano 0.9.0 is installed by default on Savio nodes. I was not able to get any other version of Theano to work with the GPUs.
module load python/2.7 # installs pip; other versions also available
module load cuda/7.5 # other versions also available, but I could not get them to work
pip install scikit-cuda --user
FWIW, I installed everything except Theano into a clean Conda env
In ~/.theanorc
:
[nvcc]
flags=-D_FORCE_INLINES
[cuda]
root=/global/software/sl-7.x86_64/modules/langs/cuda/7.5/
[global]
# device = gpu # I don't recommend this unless you only use 1 GPU (see below)
floatX = float32
You definitely need at least the following lines in your job script:
#SBATCH --partition=savio2_gpu
-
#SBATCH --gres=gpu:<n>
where<n>
is an integer number of GPUs to request
Note: savio requires you to use 2 CPUs per GPU requested, or you will get error: Unable to allocate resources: Invalid generic resource (gres) specification
. So set --ntasks
and --cpus-per-task
appropriately
To parallelize as efficiently as possible (for my purposes), I used the following configuration (slurm job script and python script):
#SBATCH --partition=savio2_gpu
#SBATCH --gres=gpu:4
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=2
#SBATCH --array=1-3
Each node in the savio2_gpu partition has 4 GPUs, so I request a single node (--nodes=1
), all of its GPUs (--gres=gpu:4
), and 4 tasks on that node (--ntasks-per-node=4
). Then I copy this configuration 3 times (--array=1-3
) for a total of 12 tasks using 12 GPUs and 24 CPUs (--cpus-per-task=2
).
Each of my 4 tasks on each node selects which GPU to use by setting the THEANO_FLAGS
environment variable based on the value of the SLURM_PROCID
environment variable. In bash, this would look something like export THEANO_FLAGS=$THEANO_FLAGS,device=gpu${SLURM_PROCID}
. This must be done before theano is imported, or before creating a subprocess that imports theano.
- This page has a simple script to verify that you are using the GPU (and not the CPU). It runs fast (few seconds), so doesn't hurt to run at the beginning of every job
- To see GPU information:
nvidia-smi
(only from within an interactive session where you have requested GPUs, obviously) - If you use the Python subprocess approach, I found that setting
shell=True
in the call tosubprocess.call()
fixes many issues.