Slurm Workload Manager

The task of a workload manager on an HPC system is to control the access to (compute) resources and distribute "work" to these resources.

Basic Workflow

After login, a user submits jobs from the login nodes to the workload manager. A job is an application together with a resource description for this concrete run of the application. A resource description specifies the required resources, e.g., the number of compute nodes, the number of cores per compute node, the memory requirement, or the time limit, for an application run to the workload manager. After submission, a job is placed in the scheduling queue and waits for the specified resources to become available. Once the resources are available, the job is placed on the allocated compute nodes and starts running. After job completion, the allocated resources become available again for the next job.

In summary, the workflow can be summarized as follows:

  1. Prepare your job on the login nodes
    • Compile your application or load the necessary software modules
    • Transfer input data to the cluster
    • Provide a resource description
  2. Submit job to the scheduler
  3. Wait until job is scheduled, i.e., drink a coffee
  4. Wait until job is finished, i.e., drink even more coffee
  5. Examine results

Interacting with the Slurm Workload Manager

A user can interact with the workload manager using several commands:

  1. sinfo: View information about nodes and partitions
  2. squeue: View information about jobs in the scheduling queue
  3. srun: Run parallel jobs
  4. sbatch: Submit a batch script
  5. scancel: Cancel jobs
  6. scontrol: View (and modify) configurations and workload manager state
  7. sacct: Display job history
  8. sacctmgr: View accounting information
  9. sprio: View factors the compromise job priority

All these commands have a plethora of options. Information about these options can either be obtained by invoking the command with the --help option, e.g., sinfo --help, or by consulting the respective man pages, e.g., man sinfo. The man page can be exited by pressing q.

sinfo - Viewing information about nodes and partitions

The compute nodes of a cluster are grouped into logical collections, called partitions. The sinfo command can be used to view information about these partitions:

>$ sinfo

WORKQ        up 21-00:00:0      2 drain* compute-5-0-[24,28]
WORKQ        up 21-00:00:0      4   resv compute-2-0-[2-3,6,9]
WORKQ        up 21-00:00:0      9    mix compute-2-0-[11,13,20-23,26-27],compute-5-0-25
WORKQ        up 21-00:00:0     14  alloc compute-2-0-[1,4-5,7-8,10,12,14-19,25]
WORKQ        up 21-00:00:0     10   idle compute-2-0-24,compute-5-0-[17-23,26-27]
TEST         up      30:00      1  down* test-2-0-43
EPIC         up 7-00:00:00      6    mix compute-3-0-[1-4,7-8]
EPIC         up 7-00:00:00      1  alloc compute-3-0-5
EPIC2        up 7-00:00:00     17    mix compute-4-0-[1-15,17-18]
EPIC2        up 7-00:00:00      1  alloc compute-4-0-16
EPIC2        up 7-00:00:00      1   idle compute-4-0-19
EPICALL      up 7-00:00:00     23    mix compute-3-0-[1-4,7-8],compute-4-0-[1-15,17-18]
EPICALL      up 7-00:00:00      2  alloc compute-3-0-5,compute-4-0-16
EPICALL      up 7-00:00:00      1   idle compute-4-0-19
V100         up      30:00      2    mix compute-5-0-[3-4]
V100         up      30:00      3   idle compute-5-0-[1-2,5]
V100-IDI     up 7-00:00:00      5    mix compute-5-0-[3-4,33-35]
V100-IDI     up 7-00:00:00      3   idle compute-5-0-[1-2,5]
STORAGE      up 1-00:00:00      2   idle idun-samba[1-2]

It prints the following columns:

  1. PARTITION: The partition name.
  2. AVAIL: Availability of a partition. A partition can either be up or down.
  3. TIMELIMIT: Maximum time limit for any user job in days-hours:minutes:seconds.
  4. NODES: Number of nodes with this respective state in a partition.
  5. STATE: State of the respective nodes:
    • idle: The nodes are not running any jobs and are available for use.
    • alloc: All CPUs of the nodes are running jobs.
    • mix: The nodes have some of their CPUs allocated for jobs, while others are idle.
    • resv: The nodes are part of a reservation and are not available for use.
    • drain: The nodes are unavailable for use.
    • down: The nodes are unavailable for use.
  6. NODELIST: The node names that are part of this partition.

squeue - Viewing information about jobs in the scheduling queue

After a job is submitted, it is placed in the scheduling queue. The squeue command enables a user to view information about jobs in this queue:

>$ squeue --long
681513_[966-999%30 EPICALL,V  flatspin   johannj  PENDING        0:00   3-00:00:00      1  (JobArrayTaskLimit)
            682821      EPIC      bash  andrhaal  PENDING        0:00      1:30:00      1  (Resources)
            682503     WORKQ   wca_mix  sondresc  RUNNING     5:43:09   1-16:00:00      1  compute-2-0-5
            681674     WORKQ  complexi  davidore  RUNNING    23:09:50   3-00:00:00      1  compute-2-0-26
            681673     WORKQ  complexi  davidore  RUNNING    23:10:02   3-00:00:00      1  compute-2-0-27
            681949     WORKQ      rank   johannj  RUNNING    17:34:15   7-00:00:00      1  compute-2-0-15
            682767   EPICALL  gpu-CNN-  halvorbm  RUNNING     1:32:20   7-00:00:00      1  compute-3-0-5
          680648_3     WORKQ  tmp.OsU1  kjkuiper  RUNNING  1-23:22:47   5-00:00:00      1  compute-2-0-11
            682498     WORKQ  Converge   corinnn  RUNNING     5:45:42   1-00:00:00      1  compute-2-0-4
            682490     WORKQ  Converge   corinnn  RUNNING     5:50:23   1-00:00:00      1  compute-2-0-1
          680637_3     WORKQ  tmp.8WPu  kjkuiper  RUNNING  2-00:11:42   5-00:00:00      1  compute-5-0-25
            681669     WORKQ  field500  andrbols  RUNNING    23:12:53  10-00:00:00      1  compute-2-0-27
        681513_809   EPICALL  flatspin   johannj  RUNNING     4:51:17   3-00:00:00      1  compute-3-0-4
        681513_874   EPICALL  flatspin   johannj  RUNNING     2:18:46   3-00:00:00      1  compute-4-0-6
        681513_965  V100-IDI  flatspin   johannj  RUNNING     1:01:38   3-00:00:00      1  compute-5-0-34
        681513_955  V100-IDI  flatspin   johannj  RUNNING     1:04:08   3-00:00:00      1  compute-5-0-33
        681513_953   EPICALL  flatspin   johannj  RUNNING     1:06:08   3-00:00:00      1  compute-4-0-13
        681513_937  V100-IDI  flatspin   johannj  RUNNING     1:24:09   3-00:00:00      1  compute-5-0-35
        681513_935  V100-IDI  flatspin   johannj  RUNNING     1:29:39   3-00:00:00      1  compute-5-0-35
        681513_922   EPICALL  flatspin   johannj  RUNNING     1:30:07   3-00:00:00      1  compute-4-0-7
        681513_930  V100-IDI  flatspin   johannj  RUNNING     1:30:07   3-00:00:00      1  compute-5-0-35
        681513_933  V100-IDI  flatspin   johannj  RUNNING     1:30:07   3-00:00:00      1  compute-5-0-3
            681476   EPICALL   CDD-Gr4  jimeling  RUNNING  1-02:57:28   4-00:00:00      1  compute-4-0-16
            682678     WORKQ  RE_Di_Ni  jimeling  RUNNING     2:46:54   3-00:00:00      1  compute-2-0-25
            681577     WORKQ  chnTur_3   janniks  RUNNING  1-00:10:39  14-00:00:00      1  compute-2-0-8
            682823     WORKQ  nu_gamma   julihag  RUNNING       32:14     23:59:00      4  compute-2-0-[20-23]
            682668      EPIC  Rainbow     tylerm  RUNNING     3:11:10     16:00:00      1  compute-3-0-7
            682664      EPIC  Wavenet     tylerm  RUNNING     3:23:10     10:00:00      2  compute-3-0-[2-3]
            682765     WORKQ  xygradpa  siljekre  RUNNING     1:41:46      5:00:00      1  compute-2-0-19
            679107     EPIC2  detectio   aminabo  RUNNING  3-20:33:26   7-00:00:00      1  compute-4-0-18
            681461     EPIC2     kitti   giusepb  RUNNING  1-04:19:34   7-00:00:00      1  compute-4-0-2
            678241     EPIC2    hrnet1   aminabo  RUNNING  4-04:54:52   7-00:00:00      1  compute-4-0-10
            678238     EPIC2     depth   aminabo  RUNNING  4-05:37:28   7-00:00:00      1  compute-4-0-8
          677767_1     WORKQ      MCrc    pierom  RUNNING  5-04:00:23   9-00:00:00      1  compute-2-0-27
          677297_1     WORKQ      MCfn    pierom  RUNNING  5-17:02:14  12-00:00:00      1  compute-2-0-27
          677297_7     WORKQ      MCfn    pierom  RUNNING  5-17:02:14  12-00:00:00      1  compute-2-0-27

It prints the following columns: 1. JOBID: The id of the job. It uniquely identifies a job. 2. PARTITION: The partition the job was submitted to. 3. NAME: Job name. 4. USER: The user that submitted the job. 5. STATE: The state a job is currently in: * RUNNING: The job is running. * PENDING: The job is waiting for resources. * COMPLETED: The job finished on all compute nodes. * FAILED: The job terminated with a failure. * CANCELLED: The job was explicitly cancelled by the user or system administrator. * TIMEOUT: The job reached its time limit and is terminated. * ... 6. TIME: The job's time spent running. 7. TIME_LIMIT: The job's time limit. 8. NODES: The number of nodes allocated by the job. 9. REASON: * If job is RUNNING: It shows the allocated nodes of the job. * If job is PENDING: It shows the reason why the job is waiting for execution: * Resources: The job is waiting for resources to become available. * PartitionTimeLimit: The job's specified time limit exceeds the partition's time limit. * JobLaunchFailure: The job could not be launched. * ...

srun - Running a simple job

Let's submit our first job:

>$ cat

echo "This is my first job..."                                                                         
sleep 10                                                                                               
echo "meh, we are done!"

>$ srun --partition=WORKQ first-job

srun: job 682787 queued and waiting for resources
srun: job 682787 has been allocated resources
This is my first job...
meh, we are done!

The srun command takes the resource description, the executable, and the arguments to the executable (not present in the example above) on the command line. The submitted job receives then an id (682787), waits for resources to become available, and finally executes the application.

The resource description in the above example consists only of the following statement:

  • --partition=<partitionname>: The partition from where the compute nodes for the application are selected from.

The srun command enables users to run jobs, but has the following limitations:

  1. It prints the outputs of the application directly to the command line, blocking the user from further work.
  2. The resource description needs to be specified every time at job submission on the command line.

sbatch - Submitting a batch script

The sbatch command takes a batch script as argument, transfers it to the workload manager, and immediately returns:

>$ cat

#SBATCH --partition=WORKQ
#SBATCH --account=support
#SBATCH --nodes=2
#SBATCH --cores=1
#SBATCH --mem=20000
#SBATCH --time=1-01:00:00

module purge
module load foss/2019b

mpirun hostname

>$ sbatch

Submitted batch job 695434

>$ ls  slurm-695434.out

>$ cat slurm-695434.out

The above example submits the script using sbatch. After submission, the command returns immediately with the message Submitted batch job 695434, enabling the user to perform further tasks. The job waits now in the scheduling queue with id 695434 for resources. Once the job is running, its output is printed to the file slurm-695434.out.

In contrast to srun, the resource description for the application is provided in the batch script itself using the #SBATCH directive. It consists of the following statements:

  1. --partition=<partitionname>: The partition from where the compute nodes are selected from.
  2. --account=<accountname>: The account that is credited with the utilized resources.
  3. --nodes=<value>: The number of nodes required for the job.
  4. --cores=<value>: The number of cores per node required for the job.
  5. --mem=<value>: The memory requirement of the job in megabytes (MB).
  6. --time=<days>-<hours>:<minutes>:<seconds>: The time limit of the job.

After the resource description, the file clears all environment modules and then loads the foss/2019b module. Finally, the mpirun hostname runs on the compute nodes, printing the host name of each compute node.

scancel - Cancelling jobs

A running or pending job can be cancelled using scancel <jobid>:

>$ scancel 695434

It is also possible to cancel multiple jobs at once:

>$ scancel --state=PENDING --user=<username> --partition=WORKQ

This will cancel all pending jobs beloging to user <username> in partition WORKQ.

scontrol - Viewing and modifying configurations

The scontrol command can be used to view and modify different slurm configurations, including job, node, partition, reservation, and overall system configuration.

Viewing/Modifying job configurations

Let's assume we submit the following job script:

>$ cat

#SBATCH --partition=WORKQ
#SBATCH --account=support
#SBATCH --nodes=1
#SBATCH --cores=1
#SBATCH --time=30-00:00:00
#SBATCH --mem=1GB

>$ sbatch

Submitted batch job 709391

After submission, we realize that the job is not running with reason PartitionTimeLimit:

>$ squeue -u <username>

709391     WORKQ test.slu <username> PD       0:00      1 (PartitionTimeLimit)

Looking at the job's properties, we realize that our time limit is set to 30 days while the WORKQ parittion's time limit is 21 days:

>$ scontrol show job 709391

JobId=709391 JobName=test.slurm
   UserId=<username>(22700) GroupId=fidi(13730) MCS_label=N/A
   Priority=3112878 Nice=0 Account=support QOS=highest
   JobState=PENDING Reason=PartitionTimeLimit Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=30-00:00:00 TimeMin=N/A
   SubmitTime=2020-02-28T09:08:44 EligibleTime=2020-02-28T09:08:44
   StartTime=2020-02-28T09:08:44 EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=WORKQ AllocNode:Sid=idun-login1:142678
   ReqNodeList=(null) ExcNodeList=(null)
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:1:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=1024M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)

>$ sinfo

WORKQ        up 21-00:00:0      7    mix compute-2-0-[13,26],compute-5-0-[17-18,22,24-25]

Instead of cancelling the job and resubmitting it, we can adjust its properties as long as it is pending:

>$ scontrol update job <jobid> timelimit=1-00:00:00

>$ squeue -u <username>

709391     WORKQ test.slu <username>  R       0:12      1 compute-2-0-26

Slurm permits to adjust a plethora of job properties. The most useful are:

  • account: The account of a job.
  • arraytaskthrottle: The throttle value of an array job.
  • dependency: FIXME write something.
  • excnodelist: A list of nodes that are excluded from job execution.
  • features: FIXME write something.
  • gres: The generic resources of a job, e.g. GPUs.
  • minmemorynode: The job's memory per node.
  • jobname: The job's name.
  • numcpus: The job's number of CPUs.
  • numnodes: The job's number of nodes.
  • partition: The job's partition.
  • reqnoddelist: The job's list of required nodes.
  • timelimit: The job's time limit.
  • ...

Viewing node, partition, or reservation configurations

The configuration of individual nodes, partitions, or reservations can be viewed as follows:

>$ scontrol show node compute-2-0-1

NodeName=compute-2-0-1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.83
   NodeAddr=compute-2-0-1 NodeHostName=compute-2-0-1 Version=16.05
   OS=Linux RealMemory=128656 AllocMem=0 FreeMem=123092 Sockets=20 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=219844 Weight=2000 Owner=N/A MCS_label=N/A
   BootTime=2019-10-16T12:11:54 SlurmdStartTime=2020-01-31T17:14:46
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

>$ scontrol show partition WORKQ

   AllowGroups=nits,itea_lille AllowAccounts=ALL AllowQos=ALL
   AllocNodes=idun-login3,idun-login2,idun-login1 Default=NO QoS=N/A
   DefaultTime=00:10:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=21-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO PreemptMode=OFF
   State=UP TotalCPUs=876 TotalNodes=39 SelectTypeParameters=NONE

>$ scontrol show reservation

ReservationName=res-nv-ikj2 StartTime=2020-03-11T10:00:00 EndTime=2020-03-11T13:00:00 Duration=03:00:00
   Nodes=compute-2-0-[1-5] NodeCnt=5 CoreCnt=100 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
   Users=(null) Accounts=nv-ikj Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=tem-workshop StartTime=2020-03-02T07:00:00 EndTime=2020-03-02T20:00:00 Duration=13:00:00
   Nodes=compute-3-0-[1-3] NodeCnt=3 CoreCnt=108 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
   Users=(null) Accounts=share-nv-fys-tem,nv-fys-tem Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a

sacct - Displaying job history

The sacct command can be used to browse the job history:

>$ sacct -X --user=<username> --starttime=2020-02-01

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
644314       Healthy-4+      WORKQ    support        160  COMPLETED      0:0
644618       Healthy-4+      WORKQ    support        160  COMPLETED      0:0
644689       test.slurm      WORKQ   training          1  COMPLETED      0:0
652143       Healthy-4+      WORKQ    support        160 CANCELLED+      0:0
652144       Healthy-4+      WORKQ    support        160  COMPLETED      0:0
671493      WORKQ    support          1  COMPLETED      0:0
671494          Rscript      WORKQ       test          1     FAILED      2:0
671495      WORKQ share-iv-+          1     FAILED      2:0
671496      WORKQ share-iv-+          1     FAILED      1:0
671501      WORKQ share-iv-+          1     FAILED      1:0
671502      WORKQ share-iv-+          1     FAILED      1:0
671503      WORKQ share-iv-+          1     FAILED      1:0
671509      WORKQ share-iv-+          1     FAILED      1:0
671513      WORKQ share-iv-+          1     FAILED      1:0
677141       Healthy-4+      WORKQ    support        160  COMPLETED      0:0
680570             bash      WORKQ       test          1 CANCELLED+      0:0
680571             bash      WORKQ       test          1  COMPLETED      0:0
682786        first-job      WORKQ       test          1     FAILED     13:0
682787        first-job      WORKQ       test          1  COMPLETED      0:0
682789        first-job      WORKQ       test          1 CANCELLED+      0:0
695434       second-jo+      WORKQ    support          2  COMPLETED      0:0
700306       second-jo+      WORKQ    support          2 CANCELLED+      0:0
703551        first-job      WORKQ       test          1  COMPLETED      0:0
703555       second-jo+      WORKQ    support          2  COMPLETED      0:0
703628       test.slurm   V100-IDI    support          1     FAILED      1:0
703629       test.slurm   V100-IDI    support          1  COMPLETED      0:0
703630       test.slurm   V100-IDI    support          1     FAILED      1:0
703631       test.slurm   V100-IDI    support          1     FAILED      1:0
703632       test.slurm   V100-IDI    support          1     FAILED      1:0
703633       test.slurm   V100-IDI    support          1  COMPLETED      0:0
703674       test.slurm   V100-IDI    support          1  COMPLETED      0:0
709391       test.slurm      WORKQ    support          1  COMPLETED      0:0

The above example prints all jobs from user <username> after the 1st of February 2020. The sacct command provides many useful options:

  • --user=: Select only jobs from specified users.
  • --starttime=: Select only jobs after the specified time.
  • --endtime=: Select only jobs that finished before the specified time.
  • --partition=: Select only jobs from specified partitions.
  • --nodelist=: Select only jobs that ran on any of the specified nodes.
  • --parseable: Print output in a parseable format.
  • --helpformat: Print the fields that can be specified in the --format option.
  • --format=: A comma seperated list of fields that should be printed.
  • -X: Show only cumulative statistics.

If we are only interested in the job id, user, account, and node list of a job, then we could restrict the printing as follows:

>$ sacct -X -u reissman --starttime=2020-02-01 --format=jobid,user,account,nodelist%50

       JobID      User    Account                                           NodeList
------------ --------- ---------- --------------------------------------------------
644314        reissman    support                        compute-2-0-[4,10-12,15-18]
644618        reissman    support                             compute-5-0-[17-23,25]
644689        reissman   training                                      compute-2-0-9
652143        reissman    support                compute-2-0-[7-8,10,13,15,17,20-21]
652144        reissman    support                compute-2-0-[7-8,10,13,15,17,20-21]
671493        reissman    support                                     compute-2-0-27
671494        reissman       test                                     compute-2-0-27
671495        reissman share-iv-+                                     compute-2-0-26
671496        reissman share-iv-+                                     compute-2-0-26
671501        reissman share-iv-+                                     compute-2-0-26
671502        reissman share-iv-+                                     compute-2-0-26
671503        reissman share-iv-+                                     compute-2-0-26
671509        reissman share-iv-+                                     compute-2-0-26
671513        reissman share-iv-+                                     compute-2-0-26
677141        reissman    support                compute-2-0-[5,8,12,15,17,20,24-25]
680570        reissman       test                                     compute-2-0-23
680571        reissman       test                                     compute-2-0-22
682786        reissman       test                                     compute-2-0-27
682787        reissman       test                                     compute-2-0-27
682789        reissman       test                                      None assigned
695434        reissman    support                                  compute-2-0-[4-5]
700306        reissman    support                                  compute-2-0-[4-5]
703551        reissman       test                                     compute-2-0-26
703555        reissman    support                                compute-2-0-[19-20]
703628        reissman    support                                      None assigned
703629        reissman    support                                      compute-5-0-5
703630        reissman    support                                      None assigned
703631        reissman    support                                      None assigned
703632        reissman    support                                      None assigned
703633        reissman    support                                     compute-5-0-35
703674        reissman    support                                     compute-5-0-35
709391        reissman    support                                     compute-2-0-26

sacctmgr - Displaying accounting information

FIXME: write something

sprio - Displaying job priority factors

FIXME: write something