clima - CliMA/slurm-buildkite GitHub Wiki
clima.gps.caltech.edu is a GPU node with 8x NVIDIA A100 GPUs.
Getting access
Email [email protected] and request access
Setting up
Unlike central, clima has a handful of modules available. The recommended approach is to install in your home directory.
SSH config
Add to your local ~/.ssh/config file
Host clima
HostName clima.gps.caltech.edu
User [username]
To access from outside the network, either use the Caltech VPN
Match final host !ssh.caltech.edu,*.caltech.edu !exec "nc -z -G 1 login.hpc.caltech.edu 22"
ProxyJump ssh.caltech.edu
About the machine
Storage
/home/[username](capped at 1TB): mounted fromsampo, and is backed up/net/sampo/data1(200TB): mounted fromsampo. Not backed up, but somewhat protected by redundant RAID partition/scratch(70TB): fast SSD, not backed up and no RAID redundancy
CPU usage
top
GPUs
clima has 8×NVIDIA 80GB A100 GPUs, connected via NVlink.
nvidia-smigives a summary of all the GPUsnvidia-smi topo -mshows the connections between GPUs and CPUs
nvtopgives you a live-refresh of current GPU usage
Software
It has a single-node installation of slurm.
We have set up a common environment. You can load this by
module load common
which currently loads
openmpi/4.1.5-cuda julia/1.9.3 cuda/julia-pref
This will set the appropriate Julia preferences, so you should not need to e.g. call MPIPreferences.use_system_binary().
Usage etiquette
Please avoid using clima for long-running CPU-only jobs. The Resnick HPC cluster is better for that.
While GPUs can be used directly, it is always recommended to schedule jobs using Slurm: this prevents allocation of multiple jobs on the same GPU, which can cause significant performance degradation.
For example
$ srun --gpus=2 --pty bash -l # request a session with 2 GPUs
$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-1768fcec-d945-7435-1f8e-85d30cdf310e)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-6420b6b9-bb34-a58d-8090-61887fd97931)
See also notes on interactive jobs via Caltech-HPC: https://www.hpc.caltech.edu/documentation/slurm-commands
Weekend scheduled runs on clima
(updated Jan 16, 2025)
ClimaAtmos longruns
- Friday 10pm PST - (est.) Saturday 6pm PST
- runs use 18 x 1 GPU (up to 12h each)
ClimaCoupler benchmarks
- Saturday 9pm PST - (est.) Sunday 12am PST
- runs use 4 x 4 GPUs (10-15 mins each)
ClimaCoupler longruns
- Sunday 12am PST - (est.) Monday 12am PST
- runs use 2 x 1 GPU on clima (22h each)
ClimaCoupler AMIP
- Sunday 12am PST - (est.) Wednesday 12am PST
- run uses 1 GPU (3 days)