Slurm troubleshooting - ciemat-tic/codec GitHub Wiki

Slurm Troubleshooting:

##Error:

slurmctld: error: Munge decode failed: Invalid credential
slurmctld: error: authentication: Invalid credential 
slurmctld: error: slurm_receive_msg: Protocol authentication error

solution: munge key must be identical both in master and computing nodes. For some reason, sometimes is different (I think other times it is not present on computing nodes, or something). Anyway, error is solved by copying it from /etc/munge/something.key on the master to the computing nodes.

##Error: MPICH does not work when employing a single physical node with several slurmd running. it requires a "standard" cluster (vitualized or whatever)

##Error: Compute element not appearing on slurm.

Slurmctld shows

slurmctld: error: Munge decode failed: Expired credential
slurmctld: ENCODED: Fri Sep 11 09:53:10 2015
slurmctld: DECODED: Wed Sep 16 04:40:32 2015

showing the date is wrong. This somethimes happens after hibernating the machines.The problem is that NTP does not update the date if the difference is over 1000 seconds, so we have to remove that limit.

Solution:

 service ntpd stop #stop ntpd.
 ntpd -g #Allow the first adjustment to be Big
service ntpd start #and start ntp again

##Several jobs executed on the same core

SOFTWARE STACK Slurm 16.05 mvapich 2.2b

PROBLEM When Slurm is configured to share resources (sockets/cores), MPI library assigns the same cores to different processes.

It happens when submitting tasks with "sbatch" that contains a "mpirun" or "mpiexec" command inside.

CAUSE The problem is that mvapich is not aware of Slurm, so it does not know that there are several jobs running on the same machine

DIAGNOSIS In the computing node, execute "top". Then, in order to see where are the jobs running, press "f" to enter the configuration menu, choose "P", and then exit with q. Then look at the job name to detect collisions.

SOLUTION The solution is the usage of cgroups in slurm

Enable cgroups for slurm:

cat slurm.conf
TaskPlugin=task/cgroup

Configure cgroups in slurm:

cat cgroup.conf (in slurm etc folder)
CgroupAutomount=yes 
ConstrainCores=yes 

Now slurm will decide in which particular core to run each task. It can be seen with scontrol -dd show job <job_id>

Now, configure mvapich to listen to Slurm. To do so, compile mvapich with Slurm support with "--with-slurm="

and disable affinity: cat /etc/mvapich2.conf MV2_ENABLE_AFFINITY=0

⚠️ **GitHub.com Fallback** ⚠️