slurm commands - raeker/ARC-Wiki-Test GitHub Wiki
This page is a work-in-progress. Relevant commands will be placed here and the page will be reorganized once a competent list is compiled.
Submit a job - sbatch <jobname>
Submit an interactive job - salloc <options>
- A full list of options can be found in the man page for salloc
Delete a job - scancel <jobID>
Delete all jobs for a user - scancel -u <user>
Cancel an indexed job in an array - scancel <jobID>_<index>
Show all queued jobs - squeue -all
- By partition - squeue -p <partition_name>
- By account - squeue -A <account_name>
- By user - squeue -u <user>
- By node - squeue -w <nodeID>
- By job - squeue -j <jobID>
Show job details *- scontrol show job <jobID> *
Show job resources - sacct -j <jobID> -l
Hold a job - scontrol hold <jobID>
Release Held job - scontrol release <jobID>
Suspend a job - scontrol suspend <jobID>
Resume job - scontrol release <jobID>
Display job on a node - squeue -w <nodeID>
To test a job to see when it will run - sbatch --test-only <scriptname>
List all pending jobs for a user - squeue -u <user> -t PENDING
- This could potentially be used for any job state
Display jobs on a specific account - sacct -A <accountname>
Show all jobs for a user - sacct -u <user> OR squeue -u <user>
Show job with a specific state - sacct -s <state> (refer to the list below for a full list of job states)
List all jobs in a partition for a user - squeue -u <user> -p <partitionname>
Show the expected start time for a job - squeue --start -j <jobID>
Show all jobs on an account - squeue -A <accountname>
Show generic partition info - sinfo
Show specific partition info - scontrol show partition
Show node details - scontrol show node <nodeID> (To view all nodes leave off the node id)
Show list of down nodes - sinfo -RlN (lower case L)
List the reasons for nodes being down - *sinfo --list-reasons *
Show resource usage - sstat <jobID> (not very accurate)
Show what jobs are running on a partition - sstate -p <partition name> (adding -x will add job steps)
Show what jobs are running on a node - sacct -N <node number> (adding -x will add job steps)
Add a user to a billing account - *sacctmgr add user <user> account=<account> *
List users belonging to a billing account - sacctmgr show assoc -p account=<account>
List description and organization for account - sacctmgr show account <account>
List default account for user - sacctmgr show user <user>
List all accounts for a user - sacctmgr show assoc cluster= <cluster> user=<user>
Create account *- sacctmgr add account <account> description="<description>" organization=<org> parent=<parent_account> *
Add a new slurm user - *sacctmgr create user name=<user> defaultaccount=<default_account> *
Remove user from billing account - sacctmgr remove user <user> where account=<account> cluster=<cluster_name>
Show current usage of account resources - sreport cluster accountutilizationbyuser tree
Show each user's usage of the cluster - sreport cluster userutilizationbyaccount
Show top CPU usage for root accounts -*** *sreport -T Billing -v cluster AccountUtilizationByUser | sort -n -k 4 | grep "_root" | tail -10
- CA=CANCELLED
- Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
- CD=COMPLETED
- Job has terminated all processes on all nodes.
- CF=CONFIGURING
- Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).
- CG=COMPLETING
- Job is in the process of completing. Some processes on some nodes may still be active.
- F=FAILED
- Job terminated with non-zero exit code or other failure condition.
- NF=NODE_FAIL
- Job terminated due to failure of one or more allocated nodes.
- PD=PENDING
- Job is awaiting resource allocation.
- PR=PREEMPTED
- Job terminated due to preemption.
- R=RUNNING
- Job currently has an allocation.
- S=SUSPENDED
- Job has an allocation, but execution has been suspended.
- TO=TIMEOUT
- Job terminated upon reaching its time limit.