Exx Slurm Setup - norlab-ulaval/Norlab_wiki GitHub Wiki
This guide's purpose is to give a quick overview on how to install Slurm and Docker on compute servers.
To manage jobs on the Exx server, we use Slurm Workload Manager. I allows to schedule jobs and manage resources like GPU.
Here's a short introduction to Slurm by Bull.
To install Slurm, we follow the instructions from slothparadise. Note that you may need this line to a free number:
MUNGEUSER=991
Use Slurm Configuration Tool to generate configuration files. Here's the parameters we used:
Put the generated configurations into /etc/slurm.
Generate a cgroup configuration file from the example:
cp cgroup.conf.example cgroup.conf
Configurations for Exx are stored here: willGuimont/exx_slurm_config
https://gist.github.com/DaisukeMiyamoto/d1dac9483ff0971d5d9f34000311d312 https://slurm.schedmd.com/accounting.html#mysql-configuration
https://slurm.schedmd.com/accounting.html#mysql-configuration https://slurm.schedmd.com/accounting.html#database-configuration
sacctmgr add user brian Account=norlab
https://www.mail-archive.com/[email protected]/msg04744.html
Ldd nvml
Install Docker Engine and follow the post-installation steps for Linux to allow non-sudo users to use docker.
Start docker on boot:
sudo systemctl enable docker.service
sudo systemctl enable containerd.service
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo yum clean expire-cache
sudo yum install nvidia-container-toolkit -y
Verify that you can see GPUs from docker containers:
docker run --rm --gpus all -e NVIDIA_VISIBLE_DEVICES=all nvidia/cuda:11.0-base nvidia-smi
Add those lines to the cronjob file with crontab -e
:
0 9 * * 3 docker network prune
30 9 * * 3 docker container prune
useradd -c 'Full name' -m <username> -G docker
Add the following line to the new user's .bashrc
:
PATH=$PATH:/usr/local/bin
Add the user to account manager:
sacctmgr add user <username> Account=norlab
Add an example job example_job.sh
:
#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --time=4-00:00
#SBATCH --job-name=ExampleJob
#SBATCH --output=%x-%j.out
docker run --rm bash -c "echo 'working from slurm'"