slurm commands - raeker/ARC-Wiki-Test GitHub Wiki

Advanced Research Computing : Slurm Commands

This page is a work-in-progress. Relevant commands will be placed here and the page will be reorganized once a competent list is compiled.

Jobs

Submit a job - sbatch <jobname>

Submit an interactive job - salloc <options>

  • A full list of options can be found in the man page for salloc

Delete a job - scancel <jobID>

Delete all jobs for a user - scancel -u <user>

Cancel an indexed job in an array - scancel <jobID>_<index>

Show all queued jobs - squeue -all

  • By partition - squeue -p <partition_name>
  • By account - squeue -A <account_name>
  • By user - squeue -u <user>
  • By node - squeue -w <nodeID>
  • By job - squeue -j <jobID>

Show job details *- scontrol show job <jobID> *

Show job resources - sacct -j <jobID> -l

Hold a job - scontrol hold <jobID>

Release Held job - scontrol release <jobID>

Suspend a job - scontrol suspend <jobID>

Resume job -  scontrol release <jobID>

Display job on a node - squeue -w <nodeID>

To test a job to see when it will run - sbatch --test-only <scriptname>

List all pending jobs for a user - squeue -u <user> -t PENDING

  • This could potentially be used for any job state

Display jobs on a specific account - sacct -A <accountname>

Show all jobs for a user - sacct -u <user> OR squeue -u <user>

Show job with a specific state - sacct -s <state> (refer to the list below for a full list of job states)

List all jobs in a partition for a user - squeue -u <user> -p <partitionname>

Show the expected start time for a job - squeue --start -j <jobID>

Show all jobs on an account - squeue -A <accountname>

Nodes and Partitions

Show generic partition info - sinfo

Show specific partition info - scontrol show partition

Show node details - scontrol show node <nodeID> (To view all nodes leave off the node id)

Show list of down nodes - sinfo -RlN (lower case L)

List the reasons for nodes being down - *sinfo --list-reasons *

Show resource usage - sstat <jobID> (not very accurate)

Show what jobs are running on a partition - sstate -p <partition name> (adding -x will add job steps)

Show what jobs are running on a node - sacct -N <node number> (adding -x will add job steps)

underline">Users and Accounts

Add a user to a billing account - *sacctmgr add user <user> account=<account> *

List users belonging to a billing account - sacctmgr show assoc -p account=<account> 

List description and organization for account - sacctmgr show account <account> 

List default account for user - sacctmgr show user <user> 

List all accounts for a user - sacctmgr show assoc cluster= <cluster> user=<user>

Create account *- sacctmgr add account <account> description="<description>" organization=<org> parent=<parent_account> *

Add a new slurm user - *sacctmgr create user name=<user> defaultaccount=<default_account> *

Remove user from billing account - sacctmgr remove user <user> where account=<account> cluster=<cluster_name>

Show current usage of account resources - sreport cluster accountutilizationbyuser tree 

Show each user's usage of the cluster - sreport cluster userutilizationbyaccount 

Show top CPU usage for root accounts -*** *sreport -T Billing -v cluster AccountUtilizationByUser | sort -n -k 4 | grep "_root" | tail -10

List of job states and the corresponding codes. You'll see these next to jobs when running squeue:

  • CA=CANCELLED
    • Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
  • CD=COMPLETED
    • Job has terminated all processes on all nodes.
  • CF=CONFIGURING
    • Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).
  • CG=COMPLETING
    • Job is in the process of completing. Some processes on some nodes may still be active.
  • F=FAILED
    • Job terminated with non-zero exit code or other failure condition.
  • NF=NODE_FAIL
    • Job terminated due to failure of one or more allocated nodes.
  • PD=PENDING
    • Job is awaiting resource allocation.
  • PR=PREEMPTED
    • Job terminated due to preemption.
  • R=RUNNING
    • Job currently has an allocation.
  • S=SUSPENDED
    • Job has an allocation, but execution has been suspended.
  • TO=TIMEOUT
    • Job terminated upon reaching its time limit.

⚠️ **GitHub.com Fallback** ⚠️