Practical commands
Basic SLURM commands¶
sinfo - general system state information¶
First we determine what partitions exist on the system, what nodes they include, and general system state. This information is provided by the sinfo
command.
The *
in the partiton name indicates that this is the default partition for submitted jobs. We see that all partitions are in different states - idle (up), alloc (allocated by user) or down. The information about each partition may be split over more than one line so that nodes in different states can be identified.
The nodes in the marked by *
in the STATE column indicate the nodes that are not responding.
login01:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
testing up 30:00 2 idle login[01-02]
gpu up 2-00:00:00 4 mix n[141-143,148]
gpu up 2-00:00:00 1 alloc n144
gpu up 2-00:00:00 3 idle n[145-147]
short* up 1-00:00:00 22 drain* n[014-021,026-031,044-051]
short* up 1-00:00:00 10 mix n[001-002,025,052,058,067,073,079,081,105]
short* up 1-00:00:00 86 alloc n[003-008,012-013,022-024,032-033,036-043,053-057,059-066,068-072,074,077-078,080,082-094,097-099,102-104,106-116,119-127,131,135-136,140]
short* up 1-00:00:00 22 idle n[009-011,034-035,075-076,095-096,100-101,117-118,128-130,132-134,137-139]
medium up 2-00:00:00 22 drain* n[014-021,026-031,044-051]
medium up 2-00:00:00 10 mix n[001-002,025,052,058,067,073,079,081,105]
medium up 2-00:00:00 86 alloc n[003-008,012-013,022-024,032-033,036-043,053-057,059-066,068-072,074,077-078,080,082-094,097-099,102-104,106-116,119-127,131,135-136,140]
medium up 2-00:00:00 22 idle n[009-011,034-035,075-076,095-096,100-101,117-118,128-130,132-134,137-139]
long up 4-00:00:00 22 drain* n[014-021,026-031,044-051]
long up 4-00:00:00 10 mix n[001-002,025,052,058,067,073,079,081,105]
long up 4-00:00:00 86 alloc n[003-008,012-013,022-024,032-033,036-043,053-057,059-066,068-072,074,077-078,080,082-094,097-099,102-104,106-116,119-127,131,135-136,140]
long up 4-00:00:00 22 idle n[009-011,034-035,075-076,095-096,100-101,117-118,128-130,132-134,137-139]
The sinfo
command has many options to easily let you view the information of interest to you in whatever format you prefer.
See the man page or type sinfo --help
for more information.
squeue - information about submitted jobs¶
Next we determine what jobs exist on the system using the squeue
command.
login01:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
16048 short xgboost_ user1 PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
15739 short test3232 user2 PD 0:00 2 (Priority)
15365 short DHAI-b user1 PD 0:00 1 (Priority)
15349 gpu gpu8 test R 0:00 1 n141
Info
The JOBID field is giving information about JOB ID in SLURM. You can work with this number in your SLURM scripts via $SLURM_JOB_ID variable.
Info
The PARTITION field is showing on which partition the job is running.
Info
The NAME field is showing specified name of the job by user.
Info
The USER field is showing the account username of person who has submitted the job.
Info
The ST field is givig information about job state. The following job states are possible:
- running state - an abbreviation R
- pending state - an abbreviation PD
Info
The TIME field shows how long the jobs have run for using the format days:hours:minutes:seconds
Info
The NODES field is showing the number of allocated nodes.
Info
The NODELIST(REASON) field indicates where the job is running or the reason it is still pending. Typical reasons for pending jobs are:
- Resources (waiting for resources to become available)
- Priority (queued behind a higher priority job).
The squeue
command has many options to easily let you view the information of interest to you in whatever format you prefer.
The most common options include viewing jobs of a specific user (-u
) and/or jobs running on a specific node (-w
).
login01:~$ squeue -u user1
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
120387 long long_job user_1 R 1:14:18 1 n008
120396 short short_job user_1 R 0:34 2 n[024-025]
login01:~$ squeue -w n001
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
107491 long jif3d_mt user_2 R 3-13:28:20 1 n001
108441_76 long long_job user_1 R 2-06:50:21 1 n001
108441_82 long long_job user_1 R 2-06:50:21 1 n001
120398 short test user_3 R 1:36 1 n001
120379 short test user_3 R 5:23 1 n001
120333 short test user_3 R 13:39 1 n001
120318 short test user_3 R 18:00 1 n001
120272 short test user_3 R 28:04 1 n001
120242 short test user_3 R 35:13 1 n001
See the man page for more information or type squeue --help
.
srun - run parallel jobs¶
It is possible to create a resource allocation and launch the tasks for a job step in a single command line using the srun
command. Depending upon the MPI implementation used, MPI jobs may also be launched in this manner. In this example we execute /bin/hostname
on four nodes (-N 4)
and include task numbers on the output (-l). For example, if you specify -partition=short
and --time=01:00:00
, you’ll get an error because the time you’ve specified exceeds the limit for that partition.
login01:~$ srun --pty /bin/bash
This way you can tailor your request to fit both the needs of you job and the limits of the partitions.
login01:~$ srun --partition=short --export=ALL --nodes=1 --ntasks=8 --cpus-per-task=4 --mem=128G --time=02:00:00 /bin/bash
login01:~$ srun --partition=gpus --export=ALL --nodes=1 --ntasks=16 --gres=gpu:1 --cpus-per-task=1 --mem=64G --time=02:00:00 /bin/bash
See the man page for more information or type srun --help
.
sbatch - submit parallel jobs¶
More common mode of operation is to submit a script for later execution with sbatch
command. In this example the script sbatch_submit.sh
is submitted to nodes n067 and n066 (--nodelist “n[066-067]”
, note the use of a node range expression), in which the subsequent job steps will spawn four tasks with 4 cpus each. The output will appear in the file stdout.<SLURM_JOBID.out (“--output stdout.%J.out”)
. This script contains a timelimit for the job embedded within itself.
login01:~$ cat sbatch_submit.sh
#!/bin/bash
#SBATCH --account=<project_name>
#SBATCH --partition=short
#SBATCH --time=01:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4
#SBATCH --mem=64G
#SBATCH --nodelist=n[066-067]
#SBATCH --output stdout.%J.out
#SBATCH --error stderr.%J.out
## End of sbatch section
## Commands to be executed during the run of the script
login01:~$ sbatch sbatch_submit.sh
login01:~$ Submitted batch job 38793
Other options can be supplied as desired by using a prefix of “#SBATCH” followed by the option at the beginning of the script (before any commands to be executed in the script).
Alternatively, options can be provided to sbatch
on the command line:
Submitting jobs with sbatch
login01:~$ cat sbatch_submit.sh
#!/bin/bash
#SBATCH --account=<project_name>
#SBATCH --partition=short
## End of sbatch section
## Commands to be executed during the run of the script
login01:~$ sbatch --nodes 2 --nodelist "n[066-067]" --ntasks-per-node=4 --cpus-per-task=4 --mem=64G --output --output stdout.%J.out --error --output stderr.%J.out sbatch_submit.sh
login01:~$ Submitted batch job 38794
Options supplied on the command line would override any options specified within the script.
See the man page for more information or type sbatch --help
.
scancel - terminate running jobs¶
The command scancel
is used to signal or cancel jobs, job arrays or job steps. A job or job step can only be signaled by the owner of that job or root. If an attempt is made by an unauthorized user to signal a job or job step, an error message will be printed and the job will not be terminated.
login01:~$ scancel --user <username>
Jobs can be generally cancelled using jobs name and/or its SLURM ID.
login01:~$ scancel --name "test_job"
#OR
login01:~$ scancel 666
scancel
can be also used to cancel all your jobs in a specific element, i.e. state, partition...
login01:~$ scancel --state PENDING --user <username>
An arbitrary number of jobs or job steps may be signaled using job specification filters or a space separated list of specific job and/or job step IDs. If the job ID of a job array is specified with an array ID value then only that job array element will be cancelled. If the job ID of a job array is specified without an array ID value then all job array elements will be cancelled. While a heterogeneous job is in a PENDING state, only the entire job can be cancelled rather than its individual components.
See the man page for more information or type scancel --help
.
Other SLURM commands¶
seff - job accounting information¶
This command can be used to find the job efficiency report for the jobs which are completed and exited from the queue. If you run this command while the job is still in the R(Running) state, this might report incorrect information.
The seff
utility will help you track the CPU/Memory efficiency. The command is invoked as:
login01:~$ seff <jobid>
Jobs with different CPU/Memory efficiency
login01:~$ seff <jobid>
Job ID: <jobid>
User/Group: user1/group1
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 32
CPU Utilized: 41-01:38:14
CPU Efficiency: 99.64% of 41-05:09:44 core-walltime
Job Wall-clock time: 1-11:19:38
Memory Utilized: 2.73 GB
Memory Efficiency: 2.13% of 128.00 GB
login01:~$ seff <jobid>
Job ID: <jobid>
User/Group: user1/group1
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 14:24:49
CPU Efficiency: 23.72% of 2-12:46:24 core-walltime
Job Wall-clock time: 03:47:54
Memory Utilized: 193.04 GB
Memory Efficiency: 75.41% of 256.00 GB
login01:~$ seff <jobid>
Job ID: <jobid>
User/Group: user1/group1
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 64
CPU Utilized: 87-16:58:22
CPU Efficiency: 86.58% of 101-07:16:16 core-walltime
Job Wall-clock time: 1-13:59:19
Memory Utilized: 212.39 GB
Memory Efficiency: 82.96% of 256.00 TB
This illustrates a very bad job in terms of CPU/memory efficiency (below 4%), which illustrate a case where basically the user wasted 4 hours of computation while mobilizing a full node and its 64 cores.
login01:~$ seff <jobid>
Job ID: <jobid>
User/Group: user1/group1
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 64
CPU Utilized: 00:08:33
CPU Efficiency: 3.55% of 04:00:48 core-walltime
Job Wall-clock time: 00:08:36
Memory Utilized: 55.84 MB
Memory Efficiency: 0.05% of 112.00 GB
sacct - job accounting information¶
The sacct
command can be used to display status information about users historical jobs, based on users name and/or SLURM job ID. By defeault the sacct
ill only bring up information about the user’s job from the current day. By using the --starttime
flag the command will look further back to the given date:
login01:~$ sacct --user=<username> --starttime=YYYY-MM-DD
The --format
flag can be used to choose the command output (full list of variables can be found with the --helpformat
flag):
login01:~$ sacct --user=<username> --starttime=YYYY-MM-DD --jobs=<job-id> --format=var_1,var_2, ...
sacct
format variable names
Variable | Description |
---|---|
Account | The account the job ran under. |
AveCPU | Average (system + user) CPU time of all tasks in job. |
AveRSS | Average resident set size of all tasks in job. |
AveVMSize | Average Virtual Memory size of all tasks in job. |
CPUTime | Formatted (Elapsed time * CPU) count used by a job or step. |
Elapsed | Jobs elapsed time formated as DD-HH:MM:SS. |
ExitCode | The exit code returned by the job script or salloc. |
JobID | The id of the Job. |
JobName | The name of the Job. |
MaxRSS | Maximum resident set size of all tasks in job. |
MaxVMSize | Maximum Virtual Memory size of all tasks in job. |
MaxDiskRead | Maximum number of bytes read by all tasks in the job. |
MaxDiskWrite | Maximum number of bytes written by all tasks in the job. |
ReqCPUS | Requested number of CPUs. |
ReqMem | Requested amount of memory. |
ReqNodes | Requested number of nodes. |
NCPUS | The number of CPUs used in a job. |
NNodes | The number of nodes used in a job. |
User | The username of the person who ran the job. |
sprojects - view projects information¶
sprojects - View Projects Information¶
This command displays information about projects available to a user and project details, such as available allocations, shared directories and members of the project team.
The sprojects
script shows the available slurm account (projects) for the selected user ID. If no user is specified (with -u) the script will display the info for current user.
Show available accounts for the current user
user1@login01:~$ sprojects
The following slurm accounts are available for user user1:
p70-23-t
Option -a force the script to display just allocations (in corehours or GPU hours) as: SPENT/AWARDED.
Show all available allocations for the current user
login01:~$ sprojects -a
+=================+=====================+
| Project | Allocations |
+-----------------+---------------------+
| p70-23-t | CPU: 10/50000 |
| | GPU: 0/12500 |
+=================+=====================+
With -f option the script will display more details (including available allocations).
Show full info for the current user
login01:~$ sprojects -f
+=================+=========================+============================+=====================+
| Project | Allocations | Shared storages | Project users |
+-----------------+-------------------------+----------------------------+---------------------+
| p371-23-1 | CPU: 182223/500000 | /home/projects/p371-23-1 | user1 |
| | GPU: 542/1250 | /scratch/p371-23-1 | user2 |
| | | | user3 |
+-----------------+-------------------------+----------------------------+---------------------+
| p81-23-t | CPU: 50006/50000 | /home/projects/p81-23-t | user1 |
| | GPU: 766/781 | /scratch/p81-23-t | user2 |
+-----------------+-------------------------+----------------------------+---------------------+
| p70-23-t | CPU: 485576/5000000 | /home/projects/p70-23-t | user1 |
| | GPU: 544/31250 | /scratch/p70-23-t | user2 |
| | | | user4 |
| | | | user5 |
| | | | user6 |
| | | | user7 |
+=================+=========================+============================+=====================+
sprio - jobs scheduling priority information¶
Demand for HPC resources typically surpasses supply, thus a method which establishes an order when a job can run has to be implemented. By default, the scheduler allocates on a simple "first-in, first-out" (FIFO) approach. However the applications of rules and policies can change the priority of a job, which will be expressed as a number to the scheduler. sprio
command can be used to view the priorities (and their components) of waiting jobs.
Sorting all waitings jobs by their priority
login01:~$ sprio -S -y
JOBID PARTITION PRIORITY SITE AGE FAIRSHARE JOBSIZE PARTITION
36386 ncpu 3777 0 1 2679 99 1000
36387 ncpu 3777 0 0 2679 99 1000
36339 ncpu 2910 0 25 1786 99 1000
36388 ncpu 2885 0 0 1786 99 1000
36389 ncpu 2885 0 0 1786 99 1000
36390 ncpu 2885 0 0 1786 99 1000
See the slurm documentation page for more information or type sprio --help
.
sshare - list shares of associations¶
This command displays fairshare information based on the hierarchical account structure. In our case we will use it to determine the fairshare factor used in job priority calculation. Since the fairshare factor value depends on the account (AKA user project) as well, we have to define it as well.
In this case we know, that our user1 has access to the project called "p70-23-t". Therefore we can display the fairshare factor (shown here in the last column) as follows:
login01:~ $ sshare -A p70-23-t
Account User RawShares NormShares RawUsage EffectvUsage FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
p70-23-t 1 0.333333 122541631 0.364839
p70-23-t user1 1 0.111111 4798585 0.039159 0.263158
You can display all project accounts available to you using sprojects command.
See the slurm documentation for more information or type sshare --help
.
salloc - allocate resources and spawn a shell¶
The salloc
command serves to allocate resources (e.g. nodes), possibly with a set of constraints (e.g. number of processor per node) for later utilization. After submitting the salloc
command the terminal will be blocked until the job gets granted. Then the session still persists on the login node. Only when using srun
commands are executed on the requested compute node. The task send with srun
can run immediately, since the resources are allocated already.
login01~$ hostname
login01.devana.local
login01~$ salloc --nodes=1 --ntasks-per-node=4 --mem-per-cpu=2G --time=01:00:00
salloc: Pending job allocation 63752579
salloc: job 63752579 queued and waiting for resources
salloc: job 63752579 has been allocated resources
salloc: Granted job allocation 63752579
login01~$ hostname
login01.devana.local
login01~$ srun hostname
n007
salloc
starts shell on login node, not on the allocated node.
See the man page for more information or type salloc --help
.
sattach - signal and attach to running jobs¶
The sattach
command allows you to connect the standard input, output, and error streams to your current terminals ession.
login01:~$ sattach 12345.5
[...output of your job...]
n007:~$ [Ctrl-C]
login01:~$
Press Ctrl-C
to detach from the current session. Please note that you will have to give the job ID as well as step step ID. For most cases, simply append .0
to your job ID.
See the man page for more information or type sattach --help
.
sbcast - transfer file to local disk on the node¶
Sometimes, it might be beneficial to copy the executable to a local path on the compute nodes allocated to the job, instead of loading it onto the compute nodes from a slow file system such as the home.
Users can copy the executable to the compute nodes before the actual computation using the sbcast
command or the srun --bcast
flag. Making the executable available local to the compute node, e.g. in /tmp could speed up the job startup time compared to running executables from a network file system.
n007:~$ sbcast exe_on_slow_fs /tmp/${USER}_exe_filename
n007:~$ srun /tmp/${USER}_exe_filename
File permissions
Make sure to choose a temporary file name unique to your computation (e.g. include your username with the variable $USER
), or you may receive permission denied errors if trying to overwrite someone else's files.
There is no real downside to broadcasting the executable with Slurm, so it can be used with jobs at any scale, especially if you experience timeout errors associated with MPI_Init()
. Besides the executable, you can also sbcast
other large files, such as input files, shared libraries, etc. It would be easier to create a tar file to sbcast
, then untar on the compute nodes before the actual srun
instead of sbcasting multiple individual files.
See the man page for more information or type sbcast --help
.
sstat - display resources utilized by a job¶
The sstat
command allows users to
easily pull up status information about their currently running jobs.
This includes information about CPU usage,
task information, node information, resident set size
(RSS), and virtual memory (VM). We can invoke the sstat
command as such:
login01:~$ sstat --jobs=<jobid>
Showing information about running job
login01:~$ sstat --jobs=<jobid>
JobID MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ConsumedEnergy MaxDiskRead MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutAve TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot
------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- -------------- ------------ --------------- --------------- ------------ ------------ ---------------- ---------------- ------------ -------------- -------------- ------------------ ------------------ -------------- ------------------ ------------------ -------------- --------------- --------------- ------------------- ------------------- --------------- ------------------- ------------------- ---------------
152295.0 2884M n143 0 2947336K 253704K n143 0 253704K 11 n143 0 11 00:06:04 n143 0 00:06:04 1 10.35M Unknown Unknown Unknown 0 29006427 n143 0 29006427 11096661 n143 0 11096661 cpu=00:06:04,+ cpu=00:06:04,+ cpu=n143,energy=n+ cpu=00:00:00,fs/d+ cpu=00:06:04,+ cpu=n143,energy=n+ cpu=00:00:00,fs/d+ cpu=00:06:04,+ energy=0,fs/di+ energy=0,fs/di+ energy=n143,fs/dis+ fs/disk=0 energy=0,fs/di+ energy=n143,fs/dis+ fs/disk=0 energy=0,fs/di+
By default, sstat will pull up significantly more information than
what would be needed in the commands default output. To remedy this,
we can use the --format
flag to choose what we want in our
output. A chart of some these variables are listed in the table below:
Showing formatted information about running job
login01:~$ sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 152295
JobID NTasks Nodelist MaxRSS MaxVMSize AveRSS AveVMSize
------------ -------- -------------------- ---------- ---------- ---------- ----------
152295.0 1 n143 183574492K 247315988K 118664K 696216K
If you do not run any srun commands, you will not create any job steps and metrics will not be available for your job. Your batch scripts should follow this format:
#!/bin/bash
#SBATCH ...
#SBATCH ...
# set environment up
module load ...
# launch job steps
srun <command to run> # that would be step 1
srun <command to run> # that would be step 2
The main metrics code you may be interested to review are listed below.
Variable | Description |
---|---|
avecpu |
Average CPU time of all tasks in job. |
averss |
Average resident set size of all tasks. |
avevmsize |
Average virtual memory of all tasks in a job. |
jobid |
The id of the Job. |
maxrss |
Maximum number of bytes read by all tasks in the job. |
maxvsize |
Maximum number of bytes written by all tasks in the job. |
ntasks |
Number of tasks in a job. |
A full list of variables that specify data handled by sstat can be
found with the --helpformat
flag or by visiting the slurm documentation on
sstat
.
scontrol - administrative tool¶
The scontrol command can be used to report more detailed information about nodes, partitions, jobs, job steps, and configuration. It can also be used by system administrators to make configuration changes. A couple of examples are shown below.
Long partition information
login01:~$ scontrol show partitions long
PartitionName=long
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=4-00:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=1 MaxTime=4-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=n[001-140]
PriorityJobFactor=0 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=8960 TotalNodes=140 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerCPU=4000 MaxMemPerNode=UNLIMITED
TRES=cpu=8960,mem=35000G,node=140,billing=8960
TRESBillingWeights=CPU=1.0,Mem=0.256G
Node information
login01:~$ scontrol show node n148
NodeName=n148 Arch=x86_64 CoresPerSocket=32
CPUAlloc=1 CPUEfctv=64 CPUTot=64 CPULoad=1.04
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:A100-SXM4-40GB:4
NodeAddr=n148 NodeHostName=n148 Version=22.05.7
OS=Linux 3.10.0-1160.71.1.el7.x86_64 #1 SMP Tue Jun 28 15:37:28 UTC 2022
RealMemory=256000 AllocMem=64000 FreeMem=67242 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=gpu
BootTime=2023-09-06T10:29:48 SlurmdStartTime=2023-09-18T14:25:33
LastBusyTime=2023-09-18T14:02:52
CfgTRES=cpu=64,mem=250G,billing=64,gres/gpu=4
AllocTRES=cpu=1,mem=62.50G,gres/gpu=1
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
See the man page for more information or type scontrol --help
.