Skip to content

Job states

Job Status and Reason Codes

The squeue command details a variety of information on an active job’s status with state and reason codes. Job state codes describe a job’s current state in queue (e.g. pending, completed). Job reason codes describe the reason why the job is in its current state.

The following tables outline a variety of job state and reason codes you may encounter when using squeue to check on your jobs.

Job State Codes

Status Code Explaination
CANCELLED CA The job was explicitly cancelled by the user or system administrator.
COMPLETED CD The job has completed successfully.
COMPLETING CG The job is finishing but some processes are still active.
DEADLINE DL The job terminated on deadline
FAILED F The job terminated with a non-zero exit code and failed to execute.
NODE_FAIL NF The job terminated due to failure of one or more allocated nodes
OUT_OF_MEMORY OOM The Job experienced an out of memory error.
PENDING PD The job is waiting for resource allocation. It will eventually run.
PREEMPTED PR The job was terminated because of preemption by another job.
RUNNING R The job currently is allocated to a node and is running.
SUSPENDED S A running job has been stopped with its cores released to other jobs.
STOPPED ST A running job has been stopped with its cores retained.
TIMEOUT TO Job terminated upon reaching its time limit.

A full list of these Job State codes can be found in squeue documentation. or sacct documentation.

Job Reason Codes

Reason Code Explanation
Priority One or more higher priority jobs is in queue for running. Your job will eventually run.
Dependency This job is waiting for a dependent job to complete and will run afterwards.
Resources The job is waiting for resources to become available and will eventually run.
InvalidAccount The job’s account is invalid. Cancel the job and rerun with correct account.
InvaldQoS The job’s QoS is invalid. Cancel the job and rerun with correct account.
QOSGrpCpuLimit All CPUs assigned to your job’s specified QoS are in use; job will run eventually.
QOSGrpMaxJobsLimit Maximum number of jobs for your job’s QoS have been met; job will run eventually.
QOSGrpNodeLimit All nodes assigned to your job’s specified QoS are in use; job will run eventually.
PartitionCpuLimit All CPUs assigned to your job’s specified partition are in use; job will run eventually.
PartitionMaxJobsLimit Maximum number of jobs for your job’s partition have been met; job will run eventually.
PartitionNodeLimit All nodes assigned to your job’s specified partition are in use; job will run eventually.
AssociationCpuLimit All CPUs assigned to your job’s specified association are in use; job will run eventually.
AssociationMaxJobsLimit Maximum number of jobs for your job’s association have been met; job will run eventually.
AssociationNodeLimit All nodes assigned to your job’s specified association are in use; job will run eventually.

A full list of these Job Reason Codes can be found in Slurm’s documentation.

Running Job Statistics Metrics

The sstat command allows users to easily pull up status information about their currently running jobs. This includes information about CPU usage, task information, node information, resident set size (RSS), and virtual memory (VM). We can invoke the sstat command as such:

login01:~$ sstat --jobs=<jobid>

Showing information about running job

login01:~$ sstat --jobs=<jobid>
  JobID         MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask  AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutAve TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot
  ------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- -------------- ------------ --------------- --------------- ------------ ------------ ---------------- ---------------- ------------ -------------- -------------- ------------------ ------------------ -------------- ------------------ ------------------ -------------- --------------- --------------- ------------------- ------------------- --------------- ------------------- ------------------- ---------------
  152295.0          2884M           n143              0   2947336K    253704K       n143          0    253704K       11         n143              0         11   00:06:04       n143          0   00:06:04        1     10.35M       Unknown       Unknown       Unknown              0     29006427            n143               0     29006427     11096661             n143                0     11096661 cpu=00:06:04,+ cpu=00:06:04,+ cpu=n143,energy=n+ cpu=00:00:00,fs/d+ cpu=00:06:04,+ cpu=n143,energy=n+ cpu=00:00:00,fs/d+ cpu=00:06:04,+ energy=0,fs/di+ energy=0,fs/di+ energy=n143,fs/dis+           fs/disk=0 energy=0,fs/di+ energy=n143,fs/dis+           fs/disk=0 energy=0,fs/di+

By default, sstat will pull up significantly more information than what would be needed in the commands default output. To remedy this, we can use the --format flag to choose what we want in our output. A chart of some these variables are listed in the table below:

Showing formatted information about running job

login01:~$ sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 152295
  JobID          NTasks             Nodelist     MaxRSS  MaxVMSize     AveRSS  AveVMSize
  ------------ -------- -------------------- ---------- ---------- ---------- ----------
  152295.0            1                 n143 183574492K 247315988K    118664K    696216K

If you do not run any srun commands, you will not create any job steps and metrics will not be available for your job. Your batch scripts should follow this format:

#!/bin/bash
#SBATCH ...
#SBATCH ...
# set environment up
module load ...

# launch job steps
srun <command to run> # that would be step 1
srun <command to run> # that would be step 2

The main metrics code you may be interested to review are listed below.

Variable Description
avecpu Average CPU time of all tasks in job.
averss Average resident set size of all tasks.
avevmsize Average virtual memory of all tasks in a job.
jobid The id of the Job.
maxrss Maximum number of bytes read by all tasks in the job.
maxvsize Maximum number of bytes written by all tasks in the job.
ntasks Number of tasks in a job.

A full list of variables that specify data handled by sstat can be found with the --helpformat flag or by visiting the slurm documentation on sstat.

Past Job Statistics Metrics

User can use the sacct command allows users to pull up status information about past jobs. This command is very similar to sstat, but is used on jobs that have been previously run on the system instead of currently running jobs.

login01:~$ sacct [-X] --jobs=<jobid> [--format=metric1,...]
# OR, for a user, eventually between a Start and End date
login01:~$ sacct [-X] -u $USER  [-S YYYY-MM-DD] [-E YYYY-MM-DD] [--format=metric1,...]
# OR, for an account
login01:~$ sacct [-X] -A <account> [--format=metric1,...]

Use -X to aggregate the statistics relevant to the job allocation itself, not taking job steps into consideration.

The main metrics code you may be interested to review are listed below.

Variable Description
account Account the job ran under.
avecpu Average CPU time of all tasks in job.
averss Average resident set size of all tasks in the job.
cputime Formatted (Elapsed time * CPU) count used by a job or step.
elapsed Jobs elapsed time formated as DD-HH:MM:SS.
exitcode The exit code returned by the job script or salloc.
jobid The id of the Job.
jobname The name of the Job.
maxdiskread Maximum number of bytes read by all tasks in the job.
maxdiskwrite Maximum number of bytes written by all tasks in the job.
maxrss Maximum resident set size of all tasks in the job.
ncpus Amount of allocated CPUs.
nnodes The number of nodes used in a job.
ntasks Number of tasks in a job.
priority Slurm priority.
qos Quality of service.
reqcpu Required number of CPUs
reqmem Required amount of memory for a job.
reqtres Required Trackable RESources (TRES)
user Username

A full list of variables that specify data handled by sacct can be found with the --helpformat flag or by visiting the slurm documentation on sacct.

Created by: Marek Štekláč