Advanced Research Computing : Slurm Commands

This page is a work-in-progress. Relevant commands will be placed here and the page will be reorganized once a competent list is compiled.

Jobs

Submit a job - sbatch <jobname>

Submit an interactive job - salloc <options>

A full list of options can be found in the man page for salloc

Delete a job - scancel <jobID>

Delete all jobs for a user - scancel -u <user>

Cancel an indexed job in an array - scancel <jobID>_<index>

Show all queued jobs - squeue -all

By partition - squeue -p <partition_name>
By account - squeue -A <account_name>
By user - squeue -u <user>
By node - squeue -w <nodeID>
By job - squeue -j <jobID>

Show job details *- scontrol show job <jobID> *

Show job resources - sacct -j <jobID> -l

Hold a job - scontrol hold <jobID>

Release Held job - scontrol release <jobID>

Suspend a job - scontrol suspend <jobID>

Resume job - scontrol release <jobID>

Display job on a node - squeue -w <nodeID>

To test a job to see when it will run - sbatch --test-only <scriptname>

List all pending jobs for a user - squeue -u <user> -t PENDING

This could potentially be used for any job state

Display jobs on a specific account - sacct -A <accountname>

Show all jobs for a user - sacct -u <user> OR squeue -u <user>

Show job with a specific state - sacct -s <state> (refer to the list below for a full list of job states)

List all jobs in a partition for a user - squeue -u <user> -p <partitionname>

Show the expected start time for a job - squeue --start -j <jobID>

Show all jobs on an account - squeue -A <accountname>

Nodes and Partitions

Show generic partition info - sinfo

Show specific partition info - scontrol show partition

Show node details - scontrol show node <nodeID> (To view all nodes leave off the node id)

Show list of down nodes - sinfo -RlN (lower case L)

List the reasons for nodes being down - *sinfo --list-reasons *

Show resource usage - sstat <jobID> (not very accurate)

Show what jobs are running on a partition - sstate -p <partition name> (adding -x will add job steps)

Show what jobs are running on a node - sacct -N <node number> (adding -x will add job steps)

underline">Users and Accounts

Add a user to a billing account - *sacctmgr add user <user> account=<account> *

List users belonging to a billing account - sacctmgr show assoc -p account=<account>

List description and organization for account - sacctmgr show account <account>

List default account for user - sacctmgr show user <user>

List all accounts for a user - sacctmgr show assoc cluster= <cluster> user=<user>

Create account *- sacctmgr add account <account> description="<description>" organization=<org> parent=<parent_account> *

Add a new slurm user - *sacctmgr create user name=<user> defaultaccount=<default_account> *

Remove user from billing account - sacctmgr remove user <user> where account=<account> cluster=<cluster_name>

Show current usage of account resources - sreport cluster accountutilizationbyuser tree

Show each user's usage of the cluster - sreport cluster userutilizationbyaccount

Show top CPU usage for root accounts -*** *sreport -T Billing -v cluster AccountUtilizationByUser | sort -n -k 4 | grep "_root" | tail -10

List of job states and the corresponding codes. You'll see these next to jobs when running squeue:

CA=CANCELLED
- Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.

CD=COMPLETED
- Job has terminated all processes on all nodes.

CF=CONFIGURING
- Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).

CG=COMPLETING
- Job is in the process of completing. Some processes on some nodes may still be active.

F=FAILED
- Job terminated with non-zero exit code or other failure condition.

NF=NODE_FAIL
- Job terminated due to failure of one or more allocated nodes.

PD=PENDING
- Job is awaiting resource allocation.

PR=PREEMPTED
- Job terminated due to preemption.

R=RUNNING
- Job currently has an allocation.

S=SUSPENDED
- Job has an allocation, but execution has been suspended.

TO=TIMEOUT
- Job terminated upon reaching its time limit.

slurm commands - raeker/ARC-Wiki-Test GitHub Wiki

Advanced Research Computing : Slurm Commands

Jobs

Nodes and Partitions

underline">Users and Accounts

List of job states and the corresponding codes. You'll see these next to jobs when running squeue:

Back to Support Workflow

⚠️ GitHub.com Fallback ⚠️

slurm commands - raeker/ARC-Wiki-Test GitHub Wiki

Advanced Research Computing : Slurm Commands

Jobs

Nodes and Partitions

underline">Users and Accounts

List of job states and the corresponding codes. You'll see these next to jobs when running squeue:

Back to Support Workflow

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️