Slurm - Gig77/wiki GitHub Wiki

Installation

First make sure that all users submitting to SLURM have the same user and group id on all nodes!

sudo apt-get install slurm-llnl

Open /usr/share/doc/slurm-llnl/slurm-llnl-configurator.html in a web browser, fill it out, press submit button, and save resulting configuration in file /etc/slurm-llnl/slurm.conf.

Copy /etc/slurm-llnl/slurm.conf on each node.

Alternatively, put the conf file on a shared file system accessible by all nodes. At CCRI, this file was put at /data_synology/slurm/slurm.conf. The daemon startup script at /etc/init.d/slurm-llnl was modified to point to the new location:

CONFDIR=/data_synology/slurm 

Also put symbolic link to original location to please slurm daemon at startup:

sudo ln -s /data_synology/slurm/slurm.conf /etc/slurm-llnl/slurm.conf
sudo chown slurm /etc/slurm-llnl/slurm.conf 

Create a munge key

sudo /usr/sbin/create-munge-key

Copy /etc/munge/munge.key on each node

Start slurm service

sudo /etc/init.d/slurm-llnl start

Make munge happy

sudo mkdir -p /var/run/munge
sudo chown munge /var/run/munge/
sudo chmod g-w /var/log
sudo chown munge /etc/munge/munge.key 
sudo chmod 700 /etc/munge/munge.key

Start munge

sudo /etc/init.d/munge start

Test it

srun --ntasks=12 --partition=global --label /bin/hostname

Administration

See the physical configuration of a node (CPUs, RAM, Sockets, etc.)

sudo slurmd -C

Update user/group credentials in SLURM after changing them in Linux

sudo scontrol reconfig

If slurm daemon fails to start without error message, use this command to see what's wrong

sudo slurmctld -Dvvvv

Get status of all node

sinfo

See status of all jobs

ssqueue -al

Cancel all running jobs from user

scancel -u <user> 

Take node offline

sudo scontrol update nodename=<nodename> state=down

Take node online

sudo scontrol update nodename=<nodename> state=idle
⚠️ **GitHub.com Fallback** ⚠️