great lakes troubleshooting - raeker/ARC-Wiki-Test GitHub Wiki
- Check if user exists
- id <user>
- Show all Slurm Accounts for a specified user
*** If <user> is not given, 'my_accounts' will display your accounts ***
- my_accounts <user>
- Lookup Slurm root and sub accounts
- sacctmgr list account <slurm_account_root> *** Root ***
- sacctmgr list account <slurm_account> *** Sub ***
- Lookup users and limits on a Slurm sub account
- sacctmgr list assoc cluster=greatlakes account=<slurm_account>
*** Possible scripts written in the future ***
- Queued or running jobs:
****** Data from this command only last ~3 minutes after the job completes *** ***
- scontrol show job <jobID>
- Recently completed jobs:
*** Add <header>%<number> formats where needed for the '--format' headers ***
- sacct -j <jobID> --format=Cluster,Account,Partition,User,JobName,Submit,Start,End,ReqNodes,ReqCPUS,ReqMem,ReqGRES,AveCPU,AveVMSize,AveRSS,Priority,QOS,State,ExitCode
- Any jobs that cannot be diagnosed, or looked into, with Slurm commands should be troubleshooted via Kibana (https://kibana.arc-ts.umich.edu/).
- On left tab bar, select "Discover"
- On the drop down below "Add a filter+", choose either; 1) "logstash-hpc-greatlakes-*" or 2) "slurm"
- Enter the JobID in the search bar
- If an entry is found, you can select the small arrow to the left of the entry for greater, human-readable, data
- Job History
- sacct -o reqtres%-40,state,exitcode,jobname,partition,account,submit -S <start_date> -E <end_date> -A <slurm_account> -u <user>
- Add time with 'scontrol' *** Helpdesk is unable to run 'scontrol' with sudo *** ***If someone asks for this the answer is almost always no***
- sudo /opt/slurm/bin/scontrol update jobid=<jobID> TimeLimit=<new_time_limit>
- Current easiest way (ssh and remove):
- sudo su - <user>
- scancel <jobID>
- Home
- /usr/arcts/systems/scripts/homequota
- Scratch
- /usr/local/bin/scratch-quota
*** Most fields might be empty, but they can always vary, so be aware of that ***
- QOS Limits
*** Few limits are applied on the QOS ***
- sacctmgr show qos
- sacctmgr show qos <qos_name>
Important QOS to know, and limits, are:
- largemem → MaxTRESPA (Maximum Trackable Resources Per Account) - 36 cores, 1500GB
- training → Priority and MaxTRESPU (Maximum Trackable Resources Per User) - 1000, 4 cores, 20GB
- gpu → MinTRES (Minimum Trackable Resource) - 1 GPU. *** User must request at least 1 GPU when submitting to the 'gpu' partition ***
- debug → MaxWall (Maximum Walltime) and MaxTRESPU (Maximum Trackable Resources Per User) - 04:00:00, 8 cores, 40GB
- standard-oc → MaxTRESPA (Maximum Trackable Resources Per Account) - 108 cores, 900GB
- class → Priority - 1000 *** This is for each Slurm Account ending in '_line">class_root' ***
- "Cluster" Limits
*** The "cluster" limits are applied on each '_root' account ***
-
sacctmgr list assoc cluster=greatlakes account=<slurm_account_root>
-
sacctmgr list assoc cluster=greatlakes account=<slurm_account_root> format=Cluster%15,Account%20,GrpTRES%30 *** Where the "cluster" limits are ***
Current "Cluster" limits are:
- 500 cores, 5 GPUS, 3500GB per Root Account
- Slurm Account Limits
- sacctmgr list assoc cluster=greatlakes account=<slurm_account> | head -3 *** Limits are the first line below the headers ***
Important Slurm Accounts to know limits for:
- **engin_root - **90 cores,2 GPU, 450GB
- lsa1 - 120 cores, no GPUS, 600GB
- lsa2 10 GPUS
- **training - 72 cores, 10 GPUs, 360GB
- User Limits under Slurm Account
- sacctmgr list assoc cluster=greatlakes account=<slurm_account> | head -4 *** Limits are the second line below the headers, but will include a username ***
Important Slurm Accounts to know User Limits for:
- **engin1 - **36 cores, 2 GPUs, 180GB,
- **lsa1 - **24 cores, 1 GPU, 120GB, 1440 GPU walltime (24 hours)
- **training - **MaxWall - 1 hour
1) View the queue for a specified partition
- sacct -a -M greatlakes -s pending,running -o submit,user,account,priority,state,elapsed,timelimit -r <partition>
2) View the queue for a specified account
-
squeue -A <slurm_account>
-
sacct -a -M greatlakes -s pending,running -o submit,user,account,priority,state,elapsed,timelimit -A <<slurm_account>
3) View the queue for a specified user
- squeue -u <user>
- sacct -a -M greatlakes -s pending,running -o submit,user,account,priority,state,elapsed,timelimit -u <user>
- Check home directory quota: home-quota
- Check scratch directory quota: scratch-quota <accountname>_root
- Check Turbo quota: /nfs/turbo/arcts-ops/Shared/Turbo/turbo2-bin/turbo-getsnaps -n <volume name>
Node state codes are shortened as required for the field size. These node states may be followed by a special character to identify state flags associated with the node. The following node suffixes and states are used:
| Suffix / State | Definition |
|---|---|
| *** ** | The node is presently not responding and will not be allocated any new work. If the node remains non-responsive, it will be placed in the DOWN state (except in the case of COMPLETING, DRAINED, DRAINING, FAIL, FAILING nodes). |
| **~ ** | The node is presently in a power saving mode (typically running at reduced frequency).#The node is presently being powered up or configured. |
| **% ** | The node is presently being powered down. |
| **$ ** | The node is currently in a reservation with a flag value of "maintenance". |
| **@ ** | The node is pending reboot. |
| ALLOCATED | The node has been allocated to one or more jobs. |
| ALLOCATED+ | The node is allocated to one or more active jobs plus one or more jobs are in the process of COMPLETING. |
| COMPLETING | All jobs associated with this node are in the process of COMPLETING. This node state will be removed when all of the job's processes have terminated and the Slurm epilog program (if any) has terminated. |
| DOWN | The node is unavailable for use. Slurm can automatically place nodes in this state if some failure occurs. System administrators may also explicitly place nodes in this state. If a node resumes normal operation, Slurm can automatically return it to service. |
| DRAINED | The node is unavailable for use per system administrator request. |
| DRAINING | The node is currently executing a job, but will not be allocated to additional jobs. The node state will be changed to state DRAINED when the last job on it completes. Nodes enter this state per system administrator request. |
| FAIL | The node is expected to fail soon and is unavailable for use per system administrator request. |
| FAILING | The node is currently executing a job, but is expected to fail soon and is unavailable for use per system administrator request. |
| FUTURE | The node is currently not fully configured, but expected to be available at some point in the indefinite future for use. |
| **IDLE ** | The node is not allocated to any jobs and is available for use. |
| MAINT | The node is currently in a reservation with a flag value of "maintenance". |
| REBOOT | The node is currently scheduled to be rebooted. |
| MIXED | The node has some of its CPUs ALLOCATED while others are IDLE. |
| PERFCTRS (NPC) | Network Performance Counters associated with this node are in use, rendering this node as not usable for any other jobs. |
| POWER_DOWN | The node is currently powered down and not capable of running any jobs. |
| **POWERING_DOWN ** | The node is currently powering down and not capable of running any jobs. |
| **POWER_UP ** | The node is currently in the process of being powered up. |
| RESERVED | The node is in an advanced reservation and not generally available. |
| UNKNOWN | The Slurm controller has just started and the node's state has not yet been determined. |
*** Possible scripts written in the future ***
- Sinfo
*** Possible partition_str: 'standard*', 'standard-oc', 'largemem', 'gpu', 'viz', 'debug' ***
-
sinfo or sinfo --long
-
sinfo -O partition:13,nodehost:13,nodes:9,statelong:13,socketcorethread:10,cpus:8,cpusstate:16,memory:11,allocmem:11,gres:15,reason | grep -F "<partition_str>" *** This will give better detail but will not print headers...***
- Log onto a compute node
*** Must be from gl-build.arc-ts.umich.edu ***
- ssh <node>
- Find GPUs in use
- nvidia-smi
- nvidia-smi | grep '^| [ ]\+[0-9]'
These steps can be useful to diagnose a downed or otherwise non-functional compute node.
SSH into the node in question. Run ps -aux | grep slurmd
Look at the output to see if slurm daemon is running.
If slurmd is not running, find someone with root access and have them start it.
SSH into the node in question.
Run vim /etc/slurm-llnl/slurm.conf
Look for lines starting with NodeName=
Find the node in question.
An individual computer node could possibly be configured wrong. I once
had a node stay in DRAINING state until I fixed its RealMemory value. I
had it in bits when it was supposed to be in megabytes.
Find someone with root access and have them fix the config file if you
are certain it is wrong.
SSH into the node in question.
Run journalctl -xe
Type /slurmd
Press the enter key.
Use n and N to browse through the lines containing slurmd.
Any number of things could be in this file. It's up to you to interpret
those messages and document your finding here if they prove
useful.
These steps can be useful to diagnose a downed or otherwise non-functional compute node.
Run scontrol show node <hostname>
The output will contain things like State=DOWN and Reason=Node
unexpectedly rebooted [root@2019-06-21T15:12:40]
State and Reason are useful for diagnosing a non-functional node. If
you don't understand them, you can at least paste those lines into slack
and someone else will likely understand.
If you somehow have root access, you
can set a DOWN as up by running scontrol update nodename=<hostname>
state=resume
The state of that node will change to State=IDLE and Reason will be
cleared.
If a user is requesting to be removed from our "mailing lists", and this happens quite often when maintenance updates are sent through Footprints, they only get removed from the mailing list if we also remove their account.
First have to get CONFIRMATION with the user that we can deactivate their Flux User Account.
This is because, it is required that each user is on these mailing lists so we can facilitate maintenance updates, and other news. So if they want to be taken off the mailing lists, then they can't have a User Account.
If the user does not have a valid standing with the University, we will just go ahead and remove their account without their consent
Regardless whether there is data in the user's home directory or not, archive it.
You must be on gl-build to do this:
[gl-build]$ sudo /bin/tar czf /nfs/locker/oldhomedirs/userstodelete/username.tar.gz /home/username
Switch to nyxb to deactivate the User Account by running the following command (the flag used looks like a lowercase "L", but it's a capital "i"):
[nyxb]$ sudo /opt/mam/bin/mam-modify-user -I username
Now delete the user by running the userdel command; the-r option says to delete the home directory (and mail files).
[nyxb]$ sudo /usr/sbin/userdel -r username
Run updateidmgr.sh and passsync.pl`
[nyxb]$ sudo /opt/moab/scripts/updateidmgr.sh
[nyxb]$ sudo /usr/arcts/systems/scripts/passsync.pl
That should remove the user from Flux.
To reactive at a user, you would create the user as if they were brand new. All of the needed components for creation are needed for reactivation.
Check to see whether there is an archived home directory. If there is,
then make sure that you are in the root directory because /home is part of the path
for the home directory being restored.
[nyxb]$ cd /[nyxb]$ sudo tar xzvf /nfs/locker/flux-support/userstodelete/username.tar.gz