Slurm - Gig77/wiki GitHub Wiki
First make sure that all users submitting to SLURM have the same user and group id on all nodes!
sudo apt-get install slurm-llnl
Open /usr/share/doc/slurm-llnl/slurm-llnl-configurator.html
in a web browser, fill it out, press submit button, and save resulting configuration in file /etc/slurm-llnl/slurm.conf
.
Copy /etc/slurm-llnl/slurm.conf
on each node.
Alternatively, put the conf file on a shared file system accessible by all nodes. At CCRI, this file was put at /data_synology/slurm/slurm.conf
. The daemon startup script at /etc/init.d/slurm-llnl
was modified to point to the new location:
CONFDIR=/data_synology/slurm
Also put symbolic link to original location to please slurm daemon at startup:
sudo ln -s /data_synology/slurm/slurm.conf /etc/slurm-llnl/slurm.conf
sudo chown slurm /etc/slurm-llnl/slurm.conf
Create a munge key
sudo /usr/sbin/create-munge-key
Copy /etc/munge/munge.key on each node
Start slurm service
sudo /etc/init.d/slurm-llnl start
Make munge happy
sudo mkdir -p /var/run/munge
sudo chown munge /var/run/munge/
sudo chmod g-w /var/log
sudo chown munge /etc/munge/munge.key
sudo chmod 700 /etc/munge/munge.key
Start munge
sudo /etc/init.d/munge start
Test it
srun --ntasks=12 --partition=global --label /bin/hostname
See the physical configuration of a node (CPUs, RAM, Sockets, etc.)
sudo slurmd -C
Update user/group credentials in SLURM after changing them in Linux
sudo scontrol reconfig
If slurm daemon fails to start without error message, use this command to see what's wrong
sudo slurmctld -Dvvvv
Get status of all node
sinfo
See status of all jobs
ssqueue -al
Cancel all running jobs from user
scancel -u <user>
Take node offline
sudo scontrol update nodename=<nodename> state=down
Take node online
sudo scontrol update nodename=<nodename> state=idle