Multi Node Slurm Cluster Setup for LLM Tuning - KrArunT/InfobellIT-Gen-AI GitHub Wiki
Slurm Setup
Installing Slurm on the Login Node
Create Users for Slurm and Munge
export MUNGEUSER=1001
sudo groupadd -g $MUNGEUSER munge
sudo useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge
export SLURMUSER=1002
sudo groupadd -g $SLURMUSER slurm
sudo useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm
Install Munge
sudo apt-get install -y munge
sudo chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
sudo chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
sudo systemctl enable munge
sudo systemctl start munge
Install Slurm and Associated Components on the Controller Node
sudo apt-get install mariadb-server
sudo apt-get install slurmdbd
sudo apt-get install slurm-wlm
Configure MariaDB
Run the following commands as the root user:
su
mysql
grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by 'your_username' with grant option;
create database slurm_acct_db;
exit
Create Slurm Configuration Files
sudo mkdir /etc/slurm
sudo nano /etc/slurm/slurmdbd.conf
Add the following lines to /etc/slurm/slurmdbd.conf:
AuthType=auth/munge
DbdAddr=localhost
DbdHost=localhost
DbdPort=6819
SlurmUser=slurm
DebugLevel=4
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/run/slurm/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageHost=localhost
StorageLoc=slurm_acct_db
StoragePass=password
StorageUser=slurm
# Setting database purge parameters
PurgeEventAfter=12months
PurgeJobAfter=12months
PurgeResvAfter=2months
PurgeStepAfter=2months
PurgeSuspendAfter=1month
PurgeTXNAfter=12months
PurgeUsageAfter=12months
Generate Cluster Configurations
Visit [Slurm Configurator](https://slurm.schedmd.com/configurator.easy.html), generate the configuration file, and paste it into /etc/slurm/slurm.conf.
Example /etc/slurm/slurm.conf:
ClusterName=cluster
SlurmctldHost=linux0
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/run/slurmctld.pid
SlurmdPidFile=/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
TaskPlugin=task/affinity,task/cgroup
SchedulerType=sched/backfill
SelectType=select/cons_tres
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
NodeName=linux[1-32] CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
Firewall Configuration
sudo ufw allow 6817
sudo ufw allow 6818
sudo ufw allow 6819
Configure the Master Node
Run the following as root:
mkdir /var/spool/slurmctld
chown slurm:slurm /var/spool/slurmctld
chmod 755 /var/spool/slurmctld
mkdir /var/log/slurm
touch /var/log/slurm/slurmctld.log
touch /var/log/slurm/slurm_jobacct.log
touch /var/log/slurm/slurm_jobcomp.log
chown -R slurm:slurm /var/log/slurm/
chmod 755 /var/log/slurm
Update PID file locations in systemd service files:
nano /usr/lib/systemd/system/slurmctld.service
# Change:
PIDFile=/run/slurmctld.pid
# To:
PIDFile=/run/slurm/slurmctld.pid
nano /usr/lib/systemd/system/slurmdbd.service
nano /usr/lib/systemd/system/slurmd.service
Configure cgroup
echo CgroupMountpoint=/sys/fs/cgroup >> /etc/slurm/cgroup.conf
slurmd -C
Start SLURM Services
systemctl daemon-reload
systemctl enable slurmdbd
systemctl start slurmdbd
systemctl enable slurmctld
systemctl start slurmctld
Verify Services
systemctl status slurmdbd
systemctl status slurmctld
If any service is not active, try rebooting the system and check again.