Advanced Research Computing : Great Lakes Troubleshooting

User Details

Check if user exists

id <user>

Show all Slurm Accounts for a specified user

*** If <user> is not given, 'my_accounts' will display your accounts ***

my_accounts <user>

Slurm Account Details

Lookup Slurm root and sub accounts

sacctmgr list account <slurm_account_root> *** Root ***
sacctmgr list account <slurm_account> *** Sub ***

Lookup users and limits on a Slurm sub account

sacctmgr list assoc cluster=greatlakes account=<slurm_account>

Job Details

*** Possible scripts written in the future ***

Queued or running jobs:

****** Data from this command only last ~3 minutes after the job completes *** ***

scontrol show job <jobID>

Recently completed jobs:

*** Add <header>%<number> formats where needed for the '--format' headers ***

sacct -j <jobID> --format=Cluster,Account,Partition,User,JobName,Submit,Start,End,ReqNodes,ReqCPUS,ReqMem,ReqGRES,AveCPU,AveVMSize,AveRSS,Priority,QOS,State,ExitCode

Any jobs that cannot be diagnosed, or looked into, with Slurm commands should be troubleshooted via Kibana (https://kibana.arc-ts.umich.edu/).

On left tab bar, select "Discover"
On the drop down below "Add a filter+", choose either; 1) "logstash-hpc-greatlakes-*" or 2) "slurm"
Enter the JobID in the search bar
If an entry is found, you can select the small arrow to the left of the entry for greater, human-readable, data

Job History

sacct -o reqtres%-40,state,exitcode,jobname,partition,account,submit -S <start_date> -E <end_date> -A <slurm_account> -u <user>

Add Walltime to a Job

Add time with 'scontrol' *** Helpdesk is unable to run 'scontrol' with sudo *** ***If someone asks for this the answer is almost always no***

sudo /opt/slurm/bin/scontrol update jobid=<jobID> TimeLimit=<new_time_limit>

Remove a User's Job(s)

Current easiest way (ssh and remove):

sudo su - <user>
scancel <jobID>

Check Home and Scratch Storage Quotas

Home

/usr/arcts/systems/scripts/homequota

Scratch

/usr/local/bin/scratch-quota

Lookup Slurm Limits

*** Most fields might be empty, but they can always vary, so be aware of that ***

QOS Limits

*** Few limits are applied on the QOS ***

sacctmgr show qos
sacctmgr show qos <qos_name>

Important QOS to know, and limits, are:

largemem → MaxTRESPA (Maximum Trackable Resources Per Account) - 36 cores, 1500GB
training → Priority and MaxTRESPU (Maximum Trackable Resources Per User) - 1000, 4 cores, 20GB
gpu → MinTRES (Minimum Trackable Resource) - 1 GPU. *** User must request at least 1 GPU when submitting to the 'gpu' partition ***
debug → MaxWall (Maximum Walltime) and MaxTRESPU (Maximum Trackable Resources Per User) - 04:00:00, 8 cores, 40GB
standard-oc → MaxTRESPA (Maximum Trackable Resources Per Account) - 108 cores, 900GB
class → Priority - 1000 *** This is for each Slurm Account ending in '_line">class_root' ***

"Cluster" Limits

*** The "cluster" limits are applied on each '_root' account ***

sacctmgr list assoc cluster=greatlakes account=<slurm_account_root>
sacctmgr list assoc cluster=greatlakes account=<slurm_account_root> format=Cluster%15,Account%20,GrpTRES%30 *** Where the "cluster" limits are ***

Current "Cluster" limits are:

500 cores, 5 GPUS, 3500GB per Root Account

Slurm Account Limits

sacctmgr list assoc cluster=greatlakes account=<slurm_account> | head -3 *** Limits are the first line below the headers ***

Important Slurm Accounts to know limits for:

**engin_root - **90 cores,2 GPU, 450GB
lsa1 - 120 cores, no GPUS, 600GB
lsa2 10 GPUS
**training - 72 cores, 10 GPUs, 360GB

User Limits under Slurm Account

sacctmgr list assoc cluster=greatlakes account=<slurm_account> | head -4 *** Limits are the second line below the headers, but will include a username ***

Important Slurm Accounts to know User Limits for:

**engin1 - **36 cores, 2 GPUs, 180GB,
**lsa1 - **24 cores, 1 GPU, 120GB, 1440 GPU walltime (24 hours)
**training - **MaxWall - 1 hour

Slurm Queues

1) View the queue for a specified partition

sacct -a -M greatlakes -s pending,running -o submit,user,account,priority,state,elapsed,timelimit -r <partition>

2) View the queue for a specified account

squeue -A <slurm_account>
sacct -a -M greatlakes -s pending,running -o submit,user,account,priority,state,elapsed,timelimit -A <<slurm_account>

3) View the queue for a specified user

squeue -u <user>
sacct -a -M greatlakes -s pending,running -o submit,user,account,priority,state,elapsed,timelimit -u <user>

Checking quotas

Check home directory quota: home-quota
Check scratch directory quota: scratch-quota <accountname>_root
Check Turbo quota: /nfs/turbo/arcts-ops/Shared/Turbo/turbo2-bin/turbo-getsnaps -n <volume name>

Slurm State Codes

Node state codes are shortened as required for the field size. These node states may be followed by a special character to identify state flags associated with the node. The following node suffixes and states are used:

Suffix / State	Definition
*	The node is presently not responding and will not be allocated any new work. If the node remains non-responsive, it will be placed in the DOWN state (except in the case of COMPLETING, DRAINED, DRAINING, FAIL, FAILING nodes).
~	The node is presently in a power saving mode (typically running at reduced frequency).#The node is presently being powered up or configured.
%	The node is presently being powered down.
$	The node is currently in a reservation with a flag value of "maintenance".
@	The node is pending reboot.
ALLOCATED	The node has been allocated to one or more jobs.
ALLOCATED+	The node is allocated to one or more active jobs plus one or more jobs are in the process of COMPLETING.
COMPLETING	All jobs associated with this node are in the process of COMPLETING. This node state will be removed when all of the job's processes have terminated and the Slurm epilog program (if any) has terminated.
DOWN	The node is unavailable for use. Slurm can automatically place nodes in this state if some failure occurs. System administrators may also explicitly place nodes in this state. If a node resumes normal operation, Slurm can automatically return it to service.
DRAINED	The node is unavailable for use per system administrator request.
DRAINING	The node is currently executing a job, but will not be allocated to additional jobs. The node state will be changed to state DRAINED when the last job on it completes. Nodes enter this state per system administrator request.
FAIL	The node is expected to fail soon and is unavailable for use per system administrator request.
FAILING	The node is currently executing a job, but is expected to fail soon and is unavailable for use per system administrator request.
FUTURE	The node is currently not fully configured, but expected to be available at some point in the indefinite future for use.
IDLE	The node is not allocated to any jobs and is available for use.
MAINT	The node is currently in a reservation with a flag value of "maintenance".
REBOOT	The node is currently scheduled to be rebooted.
MIXED	The node has some of its CPUs ALLOCATED while others are IDLE.
PERFCTRS (NPC)	Network Performance Counters associated with this node are in use, rendering this node as not usable for any other jobs.
POWER_DOWN	The node is currently powered down and not capable of running any jobs.
POWERING_DOWN	The node is currently powering down and not capable of running any jobs.
POWER_UP	The node is currently in the process of being powered up.
RESERVED	The node is in an advanced reservation and not generally available.
UNKNOWN	The Slurm controller has just started and the node's state has not yet been determined.

Node Details

*** Possible scripts written in the future ***

Sinfo

*** Possible partition_str: 'standard*', 'standard-oc', 'largemem', 'gpu', 'viz', 'debug' ***

sinfo or sinfo --long
sinfo -O partition:13,nodehost:13,nodes:9,statelong:13,socketcorethread:10,cpus:8,cpusstate:16,memory:11,allocmem:11,gres:15,reason | grep -F "<partition_str>" *** This will give better detail but will not print headers...***

Log onto a compute node

*** Must be from gl-build.arc-ts.umich.edu ***

ssh <node>

Find GPUs in use

nvidia-smi
nvidia-smi | grep '^| [ ]\+[0-9]'

TROUBLESHOOT COMPUTE NODES WITH SSH

These steps can be useful to diagnose a downed or otherwise non-functional compute node.

SSH into the node in question. Run ps -aux | grep slurmd
Look at the output to see if slurm daemon is running.
If slurmd is not running, find someone with root access and have them start it.

SSH into the node in question.
Run vim /etc/slurm-llnl/slurm.conf
Look for lines starting with NodeName=
Find the node in question.
An individual computer node could possibly be configured wrong. I once had a node stay in DRAINING state until I fixed its RealMemory value. I had it in bits when it was supposed to be in megabytes.
Find someone with root access and have them fix the config file if you are certain it is wrong.

SSH into the node in question.
Run journalctl -xe
Type /slurmd
Press the enter key.
Use n and N to browse through the lines containing slurmd.
Any number of things could be in this file. It's up to you to interpret those messages and document your finding here if they prove useful.

TROUBLESHOOT COMPUTE NODES WITHOUT SSH

These steps can be useful to diagnose a downed or otherwise non-functional compute node.

Run scontrol show node <hostname>
The output will contain things like State=DOWN and Reason=Node unexpectedly rebooted [root@2019-06-21T15:12:40]
State and Reason are useful for diagnosing a non-functional node. If you don't understand them, you can at least paste those lines into slack and someone else will likely understand.

If you somehow have root access, you can set a DOWN as up by running scontrol update nodename=<hostname> state=resume
The state of that node will change to State=IDLE and Reason will be cleared.

Deactivating a Flux User Account (Take me off mailing lists)

If a user is requesting to be removed from our "mailing lists", and this happens quite often when maintenance updates are sent through Footprints, they only get removed from the mailing list if we also remove their account.

First have to get CONFIRMATION with the user that we can deactivate their Flux User Account.

This is because, it is required that each user is on these mailing lists so we can facilitate maintenance updates, and other news. So if they want to be taken off the mailing lists, then they can't have a User Account.

If the user does not have a valid standing with the University, we will just go ahead and remove their account without their consent

Removing a user from Great Lakes

Archiving home directory

Regardless whether there is data in the user's home directory or not, archive it.

You must be on gl-build to do this:

[gl-build]$ sudo /bin/tar czf /nfs/locker/oldhomedirs/userstodelete/username.tar.gz /home/username

Switch to nyxb to deactivate the User Account by running the following command (the flag used looks like a lowercase "L", but it's a capital "i"):

[nyxb]$ sudo /opt/mam/bin/mam-modify-user -I username

Now delete the user by running the userdel command; the-r option says to delete the home directory (and mail files).

[nyxb]$ sudo /usr/sbin/userdel -r username

Run updateidmgr.sh and passsync.pl`

[nyxb]$ sudo /opt/moab/scripts/updateidmgr.sh
[nyxb]$ sudo /usr/arcts/systems/scripts/passsync.pl

That should remove the user from Flux.

Reactivating a user on Flux

To reactive at a user, you would create the user as if they were brand new. All of the needed components for creation are needed for reactivation.

Unarchiving home directory

Check to see whether there is an archived home directory. If there is, then make sure that you are in the root directory because /home is part of the path for the home directory being restored.

[nyxb]$ cd /[nyxb]$ sudo tar xzvf /nfs/locker/flux-support/userstodelete/username.tar.gz

great lakes troubleshooting - raeker/ARC-Wiki-Test GitHub Wiki

Advanced Research Computing : Great Lakes Troubleshooting

User Details

Slurm Account Details

Job Details

Add Walltime to a Job

Remove a User's Job(s)

Check Home and Scratch Storage Quotas

Lookup Slurm Limits

Slurm Queues

Checking quotas

Slurm State Codes

Node Details

TROUBLESHOOT COMPUTE NODES WITH SSH

TROUBLESHOOT COMPUTE NODES WITHOUT SSH

Deactivating a Flux User Account (Take me off mailing lists)

Removing a user from Great Lakes

Archiving home directory

Reactivating a user on Flux

Unarchiving home directory

Back to Great Lakes Support

Back to Clusters

⚠️ GitHub.com Fallback ⚠️

great lakes troubleshooting - raeker/ARC-Wiki-Test GitHub Wiki

Advanced Research Computing : Great Lakes Troubleshooting

User Details

Slurm Account Details

Job Details

Add Walltime to a Job

Remove a User's Job(s)

Check Home and Scratch Storage Quotas

Lookup Slurm Limits

Slurm Queues

Checking quotas

Slurm State Codes

Node Details

TROUBLESHOOT COMPUTE NODES WITH SSH

TROUBLESHOOT COMPUTE NODES WITHOUT SSH

Deactivating a Flux User Account (Take me off mailing lists)

Removing a user from Great Lakes

Archiving home directory

Reactivating a user on Flux

Unarchiving home directory

Back to Great Lakes Support

Back to Clusters

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️