Multi Node Slurm Cluster Setup for LLM Tuning - KrArunT/InfobellIT-Gen-AI GitHub Wiki

Slurm Setup

Installing Slurm on the Login Node

Create Users for Slurm and Munge

export MUNGEUSER=1001
sudo groupadd -g $MUNGEUSER munge
sudo useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge

export SLURMUSER=1002
sudo groupadd -g $SLURMUSER slurm
sudo useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm

Install Munge

sudo apt-get install -y munge
sudo chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
sudo chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
sudo systemctl enable munge
sudo systemctl start munge

Install Slurm and Associated Components on the Controller Node

sudo apt-get install mariadb-server
sudo apt-get install slurmdbd
sudo apt-get install slurm-wlm

Configure MariaDB

Run the following commands as the root user:

su
mysql
grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by 'your_username' with grant option;
create database slurm_acct_db;
exit

Create Slurm Configuration Files

sudo mkdir /etc/slurm
sudo nano /etc/slurm/slurmdbd.conf

Add the following lines to /etc/slurm/slurmdbd.conf:

AuthType=auth/munge
DbdAddr=localhost
DbdHost=localhost
DbdPort=6819
SlurmUser=slurm
DebugLevel=4
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/run/slurm/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageHost=localhost
StorageLoc=slurm_acct_db
StoragePass=password
StorageUser=slurm

# Setting database purge parameters
PurgeEventAfter=12months
PurgeJobAfter=12months
PurgeResvAfter=2months
PurgeStepAfter=2months
PurgeSuspendAfter=1month
PurgeTXNAfter=12months
PurgeUsageAfter=12months

Generate Cluster Configurations

Visit [Slurm Configurator](https://slurm.schedmd.com/configurator.easy.html), generate the configuration file, and paste it into /etc/slurm/slurm.conf.

Example /etc/slurm/slurm.conf:

ClusterName=cluster
SlurmctldHost=linux0
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/run/slurmctld.pid
SlurmdPidFile=/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
TaskPlugin=task/affinity,task/cgroup
SchedulerType=sched/backfill
SelectType=select/cons_tres
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
NodeName=linux[1-32] CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

Firewall Configuration

sudo ufw allow 6817
sudo ufw allow 6818
sudo ufw allow 6819

Configure the Master Node

Run the following as root:

mkdir /var/spool/slurmctld
chown slurm:slurm /var/spool/slurmctld
chmod 755 /var/spool/slurmctld
mkdir /var/log/slurm
touch /var/log/slurm/slurmctld.log
touch /var/log/slurm/slurm_jobacct.log
touch /var/log/slurm/slurm_jobcomp.log
chown -R slurm:slurm /var/log/slurm/
chmod 755 /var/log/slurm

Update PID file locations in systemd service files:

nano /usr/lib/systemd/system/slurmctld.service
# Change:
PIDFile=/run/slurmctld.pid
# To:
PIDFile=/run/slurm/slurmctld.pid

nano /usr/lib/systemd/system/slurmdbd.service
nano /usr/lib/systemd/system/slurmd.service

Configure cgroup

echo CgroupMountpoint=/sys/fs/cgroup >> /etc/slurm/cgroup.conf
slurmd -C

Start SLURM Services

systemctl daemon-reload
systemctl enable slurmdbd
systemctl start slurmdbd
systemctl enable slurmctld
systemctl start slurmctld

Verify Services

systemctl status slurmdbd
systemctl status slurmctld

If any service is not active, try rebooting the system and check again.