Skip to content

Practical commands

Basic SLURM commands

sinfo - general system state information

First we determine what partitions exist on the system, what nodes they include, and general system state. This information is provided by the sinfo command.

The * in the partiton name indicates that this is the default partition for submitted jobs. We see that all partitions are in different states - idle (up), alloc (allocated by user) or down. The information about each partition may be split over more than one line so that nodes in different states can be identified.

The nodes in the marked by * in the STATE column indicate the nodes that are not responding.

login01:~$ sinfo
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
  testing      up      30:00      2   idle login[01-02]
  gpu          up 2-00:00:00      4    mix n[141-143,148]
  gpu          up 2-00:00:00      1  alloc n144
  gpu          up 2-00:00:00      3   idle n[145-147]
  short*       up 1-00:00:00     22 drain* n[014-021,026-031,044-051]
  short*       up 1-00:00:00     10    mix n[001-002,025,052,058,067,073,079,081,105]
  short*       up 1-00:00:00     86  alloc n[003-008,012-013,022-024,032-033,036-043,053-057,059-066,068-072,074,077-078,080,082-094,097-099,102-104,106-116,119-127,131,135-136,140]
  short*       up 1-00:00:00     22   idle n[009-011,034-035,075-076,095-096,100-101,117-118,128-130,132-134,137-139]
  medium       up 2-00:00:00     22 drain* n[014-021,026-031,044-051]
  medium       up 2-00:00:00     10    mix n[001-002,025,052,058,067,073,079,081,105]
  medium       up 2-00:00:00     86  alloc n[003-008,012-013,022-024,032-033,036-043,053-057,059-066,068-072,074,077-078,080,082-094,097-099,102-104,106-116,119-127,131,135-136,140]
  medium       up 2-00:00:00     22   idle n[009-011,034-035,075-076,095-096,100-101,117-118,128-130,132-134,137-139]
  long         up 4-00:00:00     22 drain* n[014-021,026-031,044-051]
  long         up 4-00:00:00     10    mix n[001-002,025,052,058,067,073,079,081,105]
  long         up 4-00:00:00     86  alloc n[003-008,012-013,022-024,032-033,036-043,053-057,059-066,068-072,074,077-078,080,082-094,097-099,102-104,106-116,119-127,131,135-136,140]
  long         up 4-00:00:00     22   idle n[009-011,034-035,075-076,095-096,100-101,117-118,128-130,132-134,137-139]

The sinfo command has many options to easily let you view the information of interest to you in whatever format you prefer.

See the man page or type sinfo --help for more information.


squeue - information about submitted jobs

Next we determine what jobs exist on the system using the squeue command.

login01:~$ squeue
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  16048     short xgboost_    user1 PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
  15739     short test3232    user2 PD       0:00      2 (Priority)
  15365     short   DHAI-b    user1 PD       0:00      1 (Priority)
  15349       gpu     gpu8     test  R       0:00      1 n141

Info

The JOBID field is giving information about JOB ID in SLURM. You can work with this number in your SLURM scripts via $SLURM_JOB_ID variable.

Info

The PARTITION field is showing on which partition the job is running.

Info

The NAME field is showing specified name of the job by user.

Info

The USER field is showing the account username of person who has submitted the job.

Info

The ST field is givig information about job state. The following job states are possible:

  • running state - an abbreviation R
  • pending state - an abbreviation PD

Info

The TIME field shows how long the jobs have run for using the format days:hours:minutes:seconds

Info

The NODES field is showing the number of allocated nodes.

Info

The NODELIST(REASON) field indicates where the job is running or the reason it is still pending. Typical reasons for pending jobs are:

  • Resources (waiting for resources to become available)
  • Priority (queued behind a higher priority job).

The squeue command has many options to easily let you view the information of interest to you in whatever format you prefer. The most common options include viewing jobs of a specific user (-u) and/or jobs running on a specific node (-w).

login01:~$ squeue -u user1
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  120387      long   long_job   user_1  R    1:14:18      1 n008
  120396     short  short_job   user_1  R       0:34      2 n[024-025]

login01:~$ squeue -w n001
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  107491         long jif3d_mt   user_2  R 3-13:28:20      1 n001
  108441_76      long long_job   user_1  R 2-06:50:21      1 n001
  108441_82      long long_job   user_1  R 2-06:50:21      1 n001
  120398         short     test   user_3  R       1:36     1 n001
  120379         short     test   user_3  R       5:23     1 n001
  120333         short     test   user_3  R      13:39     1 n001
  120318         short     test   user_3  R      18:00     1 n001
  120272         short     test   user_3  R      28:04     1 n001
  120242         short     test   user_3  R      35:13     1 n001

See the man page for more information or type squeue --help.


srun - run parallel jobs

It is possible to create a resource allocation and launch the tasks for a job step in a single command line using the srun command. Depending upon the MPI implementation used, MPI jobs may also be launched in this manner. In this example we execute /bin/hostname on four nodes (-N 4) and include task numbers on the output (-l). For example, if you specify -partition=short and --time=01:00:00, you’ll get an error because the time you’ve specified exceeds the limit for that partition.

login01:~$ srun --pty /bin/bash

This way you can tailor your request to fit both the needs of you job and the limits of the partitions.

login01:~$ srun --partition=short --export=ALL --nodes=1 --ntasks=8 --cpus-per-task=4 --mem=128G --time=02:00:00 /bin/bash
login01:~$ srun --partition=gpus --export=ALL --nodes=1 --ntasks=16 --gres=gpu:1 --cpus-per-task=1 --mem=64G --time=02:00:00 /bin/bash

See the man page for more information or type srun --help.


sbatch - submit parallel jobs

More common mode of operation is to submit a script for later execution with sbatch command. In this example the script sbatch_submit.sh is submitted to nodes n067 and n066 (--nodelist “n[066-067]”, note the use of a node range expression), in which the subsequent job steps will spawn four tasks with 4 cpus each. The output will appear in the file stdout.<SLURM_JOBID.out (“--output stdout.%J.out”). This script contains a timelimit for the job embedded within itself.

login01:~$ cat sbatch_submit.sh
  #!/bin/bash
  #SBATCH --account=<project_name>
  #SBATCH --partition=short
  #SBATCH --time=01:00:00
  #SBATCH --nodes=2
  #SBATCH --ntasks-per-node=4
  #SBATCH --cpus-per-task=4
  #SBATCH --mem=64G
  #SBATCH --nodelist=n[066-067]
  #SBATCH --output stdout.%J.out 
  #SBATCH --error stderr.%J.out 


  ## End of sbatch section
  ## Commands to be executed during the run of the script

login01:~$ sbatch sbatch_submit.sh
login01:~$ Submitted batch job 38793

Other options can be supplied as desired by using a prefix of “#SBATCH” followed by the option at the beginning of the script (before any commands to be executed in the script).

Alternatively, options can be provided to sbatch on the command line:

Submitting jobs with sbatch

login01:~$ cat sbatch_submit.sh
 #!/bin/bash
 #SBATCH --account=<project_name>
 #SBATCH --partition=short

 ## End of sbatch section
 ## Commands to be executed during the run of the script

login01:~$ sbatch --nodes 2 --nodelist "n[066-067]" --ntasks-per-node=4 --cpus-per-task=4 --mem=64G --output --output stdout.%J.out --error --output stderr.%J.out sbatch_submit.sh
login01:~$ Submitted batch job 38794

Options supplied on the command line would override any options specified within the script.

See the man page for more information or type sbatch --help.


scancel - terminate running jobs

The command scancel is used to signal or cancel jobs, job arrays or job steps. A job or job step can only be signaled by the owner of that job or root. If an attempt is made by an unauthorized user to signal a job or job step, an error message will be printed and the job will not be terminated.

login01:~$ scancel --user <username>

Jobs can be generally cancelled using jobs name and/or its SLURM ID.

login01:~$ scancel --name "test_job"
#OR
login01:~$ scancel 666

scancel can be also used to cancel all your jobs in a specific element, i.e. state, partition...

login01:~$ scancel --state PENDING --user <username>

An arbitrary number of jobs or job steps may be signaled using job specification filters or a space separated list of specific job and/or job step IDs. If the job ID of a job array is specified with an array ID value then only that job array element will be cancelled. If the job ID of a job array is specified without an array ID value then all job array elements will be cancelled. While a heterogeneous job is in a PENDING state, only the entire job can be cancelled rather than its individual components.

See the man page for more information or type scancel --help.


Other SLURM commands

seff - job accounting information

This command can be used to find the job efficiency report for the jobs which are completed and exited from the queue. If you run this command while the job is still in the R(Running) state, this might report incorrect information.

The seff utility will help you track the CPU/Memory efficiency. The command is invoked as:

login01:~$ seff <jobid>

Jobs with different CPU/Memory efficiency
login01:~$ seff <jobid>
  Job ID: <jobid>
  User/Group: user1/group1
  State: COMPLETED (exit code 0)
  Nodes: 1
  Cores per node: 32
  CPU Utilized: 41-01:38:14
  CPU Efficiency: 99.64% of 41-05:09:44 core-walltime
  Job Wall-clock time: 1-11:19:38
  Memory Utilized: 2.73 GB
  Memory Efficiency: 2.13% of 128.00 GB
login01:~$ seff <jobid>
  Job ID: <jobid>
  User/Group: user1/group1
  State: COMPLETED (exit code 0)
  Nodes: 1
  Cores per node: 16
  CPU Utilized: 14:24:49
  CPU Efficiency: 23.72% of 2-12:46:24 core-walltime
  Job Wall-clock time: 03:47:54
  Memory Utilized: 193.04 GB
  Memory Efficiency: 75.41% of 256.00 GB
login01:~$ seff <jobid>
  Job ID: <jobid>
  User/Group: user1/group1
  State: COMPLETED (exit code 0)
  Nodes: 1
  Cores per node: 64
  CPU Utilized: 87-16:58:22
  CPU Efficiency: 86.58% of 101-07:16:16 core-walltime
  Job Wall-clock time: 1-13:59:19
  Memory Utilized: 212.39 GB
  Memory Efficiency: 82.96% of 256.00 TB

This illustrates a very bad job in terms of CPU/memory efficiency (below 4%), which illustrate a case where basically the user wasted 4 hours of computation while mobilizing a full node and its 64 cores.

login01:~$ seff <jobid>
  Job ID: <jobid>
  User/Group: user1/group1
  State: COMPLETED (exit code 0)
  Nodes: 1
  Cores per node: 64
  CPU Utilized: 00:08:33
  CPU Efficiency: 3.55% of 04:00:48 core-walltime
  Job Wall-clock time: 00:08:36
  Memory Utilized: 55.84 MB
  Memory Efficiency: 0.05% of 112.00 GB

sacct - job accounting information

The sacct command can be used to display status information about users historical jobs, based on users name and/or SLURM job ID. By defeault the sacct ill only bring up information about the user’s job from the current day. By using the --starttime flag the command will look further back to the given date:

login01:~$ sacct --user=<username> --starttime=YYYY-MM-DD

The --format flag can be used to choose the command output (full list of variables can be found with the --helpformat flag):

login01:~$ sacct --user=<username> --starttime=YYYY-MM-DD --jobs=<job-id> --format=var_1,var_2, ...
sacct format variable names
Variable Description
Account The account the job ran under.
AveCPU Average (system + user) CPU time of all tasks in job.
AveRSS Average resident set size of all tasks in job.
AveVMSize Average Virtual Memory size of all tasks in job.
CPUTime Formatted (Elapsed time * CPU) count used by a job or step.
Elapsed Jobs elapsed time formated as DD-HH:MM:SS.
ExitCode The exit code returned by the job script or salloc.
JobID The id of the Job.
JobName The name of the Job.
MaxRSS Maximum resident set size of all tasks in job.
MaxVMSize Maximum Virtual Memory size of all tasks in job.
MaxDiskRead Maximum number of bytes read by all tasks in the job.
MaxDiskWrite Maximum number of bytes written by all tasks in the job.
ReqCPUS Requested number of CPUs.
ReqMem Requested amount of memory.
ReqNodes Requested number of nodes.
NCPUS The number of CPUs used in a job.
NNodes The number of nodes used in a job.
User The username of the person who ran the job.

sprojects - view projects information

sprojects - View Projects Information

This command displays information about projects available to a user and project details, such as available allocations, shared directories and members of the project team.

The sprojects script shows the available slurm account (projects) for the selected user ID. If no user is specified (with -u) the script will display the info for current user.

Show available accounts for the current user

user1@login01:~$ sprojects 
   The following slurm accounts are available for user user1:
   p70-23-t

Option -a force the script to display just allocations (in corehours or GPU hours) as: SPENT/AWARDED.

Show all available allocations for the current user

login01:~$ sprojects -a 
   +=================+=====================+
   |     Project     |     Allocations     |
   +-----------------+---------------------+
   | p70-23-t        | CPU:      10/50000  |
   |                 | GPU:       0/12500  |
   +=================+=====================+

With -f option the script will display more details (including available allocations).

Show full info for the current user

login01:~$ sprojects -f 
   +=================+=========================+============================+=====================+
   |     Project     |       Allocations       |      Shared storages       |    Project users    |
   +-----------------+-------------------------+----------------------------+---------------------+
   | p371-23-1       | CPU:    182223/500000   | /home/projects/p371-23-1   | user1               |
   |                 | GPU:       542/1250     | /scratch/p371-23-1         | user2               |
   |                 |                         |                            | user3               |
   +-----------------+-------------------------+----------------------------+---------------------+
   | p81-23-t        | CPU:     50006/50000    | /home/projects/p81-23-t    | user1               |
   |                 | GPU:       766/781      | /scratch/p81-23-t          | user2               |
   +-----------------+-------------------------+----------------------------+---------------------+
   | p70-23-t        | CPU:    485576/5000000  | /home/projects/p70-23-t    | user1               |
   |                 | GPU:       544/31250    | /scratch/p70-23-t          | user2               |
   |                 |                         |                            | user4               |
   |                 |                         |                            | user5               |
   |                 |                         |                            | user6               |
   |                 |                         |                            | user7               |
   +=================+=========================+============================+=====================+

sprio - jobs scheduling priority information

Demand for HPC resources typically surpasses supply, thus a method which establishes an order when a job can run has to be implemented. By default, the scheduler allocates on a simple "first-in, first-out" (FIFO) approach. However the applications of rules and policies can change the priority of a job, which will be expressed as a number to the scheduler. sprio command can be used to view the priorities (and their components) of waiting jobs.

Sorting all waitings jobs by their priority

login01:~$ sprio -S -y
  JOBID PARTITION   PRIORITY       SITE        AGE  FAIRSHARE    JOBSIZE  PARTITION
  36386 ncpu            3777          0          1       2679         99       1000
  36387 ncpu            3777          0          0       2679         99       1000
  36339 ncpu            2910          0         25       1786         99       1000
  36388 ncpu            2885          0          0       1786         99       1000
  36389 ncpu            2885          0          0       1786         99       1000
  36390 ncpu            2885          0          0       1786         99       1000

See the slurm documentation page for more information or type sprio --help.

sshare - list shares of associations

This command displays fairshare information based on the hierarchical account structure. In our case we will use it to determine the fairshare factor used in job priority calculation. Since the fairshare factor value depends on the account (AKA user project) as well, we have to define it as well.

In this case we know, that our user1 has access to the project called "p70-23-t". Therefore we can display the fairshare factor (shown here in the last column) as follows:


login01:~ $ sshare -A p70-23-t 
  Account                    User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare 
  -------------------- ---------- ---------- ----------- ----------- ------------- ---------- 
  p70-23-t                                 1    0.333333   122541631      0.364839            
  p70-23-t               user1             1    0.111111     4798585      0.039159   0.263158 

You can display all project accounts available to you using sprojects command.

See the slurm documentation for more information or type sshare --help.


salloc - allocate resources and spawn a shell

The salloc command serves to allocate resources (e.g. nodes), possibly with a set of constraints (e.g. number of processor per node) for later utilization. After submitting the salloc command the terminal will be blocked until the job gets granted. Then the session still persists on the login node. Only when using srun commands are executed on the requested compute node. The task send with srun can run immediately, since the resources are allocated already.

login01~$ hostname
  login01.devana.local

login01~$ salloc --nodes=1 --ntasks-per-node=4 --mem-per-cpu=2G --time=01:00:00
  salloc: Pending job allocation 63752579
  salloc: job 63752579 queued and waiting for resources
  salloc: job 63752579 has been allocated resources
  salloc: Granted job allocation 63752579

login01~$ hostname
  login01.devana.local

login01~$ srun hostname
  n007

salloc starts shell on login node, not on the allocated node.

See the man page for more information or type salloc --help.


sattach - signal and attach to running jobs

The sattach command allows you to connect the standard input, output, and error streams to your current terminals ession.

login01:~$ sattach 12345.5
   [...output of your job...]
n007:~$ [Ctrl-C]
login01:~$

Press Ctrl-C to detach from the current session. Please note that you will have to give the job ID as well as step step ID. For most cases, simply append .0 to your job ID.

See the man page for more information or type sattach --help.


sbcast - transfer file to local disk on the node

Sometimes, it might be beneficial to copy the executable to a local path on the compute nodes allocated to the job, instead of loading it onto the compute nodes from a slow file system such as the home.

Users can copy the executable to the compute nodes before the actual computation using the sbcast command or the srun --bcast flag. Making the executable available local to the compute node, e.g. in /tmp could speed up the job startup time compared to running executables from a network file system.

n007:~$ sbcast exe_on_slow_fs /tmp/${USER}_exe_filename
n007:~$ srun /tmp/${USER}_exe_filename

File permissions

Make sure to choose a temporary file name unique to your computation (e.g. include your username with the variable $USER), or you may receive permission denied errors if trying to overwrite someone else's files.

There is no real downside to broadcasting the executable with Slurm, so it can be used with jobs at any scale, especially if you experience timeout errors associated with MPI_Init(). Besides the executable, you can also sbcast other large files, such as input files, shared libraries, etc. It would be easier to create a tar file to sbcast, then untar on the compute nodes before the actual srun instead of sbcasting multiple individual files.

See the man page for more information or type sbcast --help.


sstat - display resources utilized by a job

The sstat command allows users to easily pull up status information about their currently running jobs. This includes information about CPU usage, task information, node information, resident set size (RSS), and virtual memory (VM). We can invoke the sstat command as such:

login01:~$ sstat --jobs=<jobid>

Showing information about running job

login01:~$ sstat --jobs=<jobid>
  JobID         MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask  AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutAve TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot

  152295.0          2884M           n143              0   2947336K    253704K       n143          0    253704K       11         n143              0         11   00:06:04       n143          0   00:06:04        1     10.35M       Unknown       Unknown       Unknown              0     29006427            n143               0     29006427     11096661             n143                0     11096661 cpu=00:06:04,+ cpu=00:06:04,+ cpu=n143,energy=n+ cpu=00:00:00,fs/d+ cpu=00:06:04,+ cpu=n143,energy=n+ cpu=00:00:00,fs/d+ cpu=00:06:04,+ energy=0,fs/di+ energy=0,fs/di+ energy=n143,fs/dis+           fs/disk=0 energy=0,fs/di+ energy=n143,fs/dis+           fs/disk=0 energy=0,fs/di+

By default, sstat will pull up significantly more information than what would be needed in the commands default output. To remedy this, we can use the --format flag to choose what we want in our output. A chart of some these variables are listed in the table below:

Showing formatted information about running job

login01:~$ sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 152295
  JobID          NTasks             Nodelist     MaxRSS  MaxVMSize     AveRSS  AveVMSize
  ------------ -------- -------------------- ---------- ---------- ---------- ----------
  152295.0            1                 n143 183574492K 247315988K    118664K    696216K

If you do not run any srun commands, you will not create any job steps and metrics will not be available for your job. Your batch scripts should follow this format:

#!/bin/bash
#SBATCH ...
#SBATCH ...
# set environment up
module load ...

# launch job steps
srun <command to run> # that would be step 1
srun <command to run> # that would be step 2

The main metrics code you may be interested to review are listed below.

Variable Description
avecpu Average CPU time of all tasks in job.
averss Average resident set size of all tasks.
avevmsize Average virtual memory of all tasks in a job.
jobid The id of the Job.
maxrss Maximum number of bytes read by all tasks in the job.
maxvsize Maximum number of bytes written by all tasks in the job.
ntasks Number of tasks in a job.

A full list of variables that specify data handled by sstat can be found with the --helpformat flag or by visiting the slurm documentation on sstat.


scontrol - administrative tool

The scontrol command can be used to report more detailed information about nodes, partitions, jobs, job steps, and configuration. It can also be used by system administrators to make configuration changes. A couple of examples are shown below.

Long partition information
login01:~$ scontrol show partitions long
  PartitionName=long
    AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
    AllocNodes=ALL Default=NO QoS=N/A
    DefaultTime=4-00:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
    MaxNodes=1 MaxTime=4-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
    Nodes=n[001-140]
    PriorityJobFactor=0 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
    OverTimeLimit=NONE PreemptMode=OFF
    State=UP TotalCPUs=8960 TotalNodes=140 SelectTypeParameters=NONE
    JobDefaults=(null)
    DefMemPerCPU=4000 MaxMemPerNode=UNLIMITED
    TRES=cpu=8960,mem=35000G,node=140,billing=8960
    TRESBillingWeights=CPU=1.0,Mem=0.256G
Node information
login01:~$ scontrol show node n148
  NodeName=n148 Arch=x86_64 CoresPerSocket=32 
    CPUAlloc=1 CPUEfctv=64 CPUTot=64 CPULoad=1.04
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:A100-SXM4-40GB:4
NodeAddr=n148 NodeHostName=n148 Version=22.05.7
OS=Linux 3.10.0-1160.71.1.el7.x86_64 #1 SMP Tue Jun 28 15:37:28 UTC 2022 
RealMemory=256000 AllocMem=64000 FreeMem=67242 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=gpu 
BootTime=2023-09-06T10:29:48 SlurmdStartTime=2023-09-18T14:25:33
LastBusyTime=2023-09-18T14:02:52
CfgTRES=cpu=64,mem=250G,billing=64,gres/gpu=4
AllocTRES=cpu=1,mem=62.50G,gres/gpu=1
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

See the man page for more information or type scontrol --help.