Job states
Job Status and Reason Codes
The squeue
command details a variety of information on an active
job’s status with state and reason codes. Job state
codes describe a job’s current state in queue (e.g. pending,
completed). Job reason codes describe the reason why the job is
in its current state.
The following tables outline a variety of job state and reason codes you may encounter when using squeue to check on your jobs.
Job State Codes¶
Status | Code | Explaination |
---|---|---|
CANCELLED | CA |
The job was explicitly cancelled by the user or system administrator. |
COMPLETED | CD |
The job has completed successfully. |
COMPLETING | CG |
The job is finishing but some processes are still active. |
DEADLINE | DL |
The job terminated on deadline |
FAILED | F |
The job terminated with a non-zero exit code and failed to execute. |
NODE_FAIL | NF |
The job terminated due to failure of one or more allocated nodes |
OUT_OF_MEMORY | OOM |
The Job experienced an out of memory error. |
PENDING | PD |
The job is waiting for resource allocation. It will eventually run. |
PREEMPTED | PR |
The job was terminated because of preemption by another job. |
RUNNING | R |
The job currently is allocated to a node and is running. |
SUSPENDED | S |
A running job has been stopped with its cores released to other jobs. |
STOPPED | ST |
A running job has been stopped with its cores retained. |
TIMEOUT | TO |
Job terminated upon reaching its time limit. |
A full list of these Job State codes can be found in squeue
documentation.
or sacct
documentation.
Job Reason Codes¶
Reason Code | Explanation |
---|---|
Priority |
One or more higher priority jobs is in queue for running. Your job will eventually run. |
Dependency |
This job is waiting for a dependent job to complete and will run afterwards. |
Resources |
The job is waiting for resources to become available and will eventually run. |
InvalidAccount |
The job’s account is invalid. Cancel the job and rerun with correct account. |
InvaldQoS |
The job’s QoS is invalid. Cancel the job and rerun with correct account. |
QOSGrpCpuLimit |
All CPUs assigned to your job’s specified QoS are in use; job will run eventually. |
QOSGrpMaxJobsLimit |
Maximum number of jobs for your job’s QoS have been met; job will run eventually. |
QOSGrpNodeLimit |
All nodes assigned to your job’s specified QoS are in use; job will run eventually. |
PartitionCpuLimit |
All CPUs assigned to your job’s specified partition are in use; job will run eventually. |
PartitionMaxJobsLimit |
Maximum number of jobs for your job’s partition have been met; job will run eventually. |
PartitionNodeLimit |
All nodes assigned to your job’s specified partition are in use; job will run eventually. |
AssociationCpuLimit |
All CPUs assigned to your job’s specified association are in use; job will run eventually. |
AssociationMaxJobsLimit |
Maximum number of jobs for your job’s association have been met; job will run eventually. |
AssociationNodeLimit |
All nodes assigned to your job’s specified association are in use; job will run eventually. |
A full list of these Job Reason Codes can be found in Slurm’s documentation.
Running Job Statistics Metrics¶
The sstat
command allows users to
easily pull up status information about their currently running jobs.
This includes information about CPU usage,
task information, node information, resident set size
(RSS), and virtual memory (VM). We can invoke the sstat
command as such:
login01:~$ sstat --jobs=<jobid>
Showing information about running job
login01:~$ sstat --jobs=<jobid>
JobID MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ConsumedEnergy MaxDiskRead MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutAve TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot
------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- -------------- ------------ --------------- --------------- ------------ ------------ ---------------- ---------------- ------------ -------------- -------------- ------------------ ------------------ -------------- ------------------ ------------------ -------------- --------------- --------------- ------------------- ------------------- --------------- ------------------- ------------------- ---------------
152295.0 2884M n143 0 2947336K 253704K n143 0 253704K 11 n143 0 11 00:06:04 n143 0 00:06:04 1 10.35M Unknown Unknown Unknown 0 29006427 n143 0 29006427 11096661 n143 0 11096661 cpu=00:06:04,+ cpu=00:06:04,+ cpu=n143,energy=n+ cpu=00:00:00,fs/d+ cpu=00:06:04,+ cpu=n143,energy=n+ cpu=00:00:00,fs/d+ cpu=00:06:04,+ energy=0,fs/di+ energy=0,fs/di+ energy=n143,fs/dis+ fs/disk=0 energy=0,fs/di+ energy=n143,fs/dis+ fs/disk=0 energy=0,fs/di+
By default, sstat will pull up significantly more information than
what would be needed in the commands default output. To remedy this,
we can use the --format
flag to choose what we want in our
output. A chart of some these variables are listed in the table below:
Showing formatted information about running job
login01:~$ sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 152295
JobID NTasks Nodelist MaxRSS MaxVMSize AveRSS AveVMSize
------------ -------- -------------------- ---------- ---------- ---------- ----------
152295.0 1 n143 183574492K 247315988K 118664K 696216K
If you do not run any srun commands, you will not create any job steps and metrics will not be available for your job. Your batch scripts should follow this format:
#!/bin/bash
#SBATCH ...
#SBATCH ...
# set environment up
module load ...
# launch job steps
srun <command to run> # that would be step 1
srun <command to run> # that would be step 2
The main metrics code you may be interested to review are listed below.
Variable | Description |
---|---|
avecpu |
Average CPU time of all tasks in job. |
averss |
Average resident set size of all tasks. |
avevmsize |
Average virtual memory of all tasks in a job. |
jobid |
The id of the Job. |
maxrss |
Maximum number of bytes read by all tasks in the job. |
maxvsize |
Maximum number of bytes written by all tasks in the job. |
ntasks |
Number of tasks in a job. |
A full list of variables that specify data handled by sstat can be
found with the --helpformat
flag or by visiting the slurm documentation on
sstat
.
Past Job Statistics Metrics¶
User can use the
sacct
command allows users to pull up
status information about past jobs.
This command is very similar to sstat
, but is used on jobs
that have been previously run on the system instead of currently
running jobs.
login01:~$ sacct [-X] --jobs=<jobid> [--format=metric1,...]
# OR, for a user, eventually between a Start and End date
login01:~$ sacct [-X] -u $USER [-S YYYY-MM-DD] [-E YYYY-MM-DD] [--format=metric1,...]
# OR, for an account
login01:~$ sacct [-X] -A <account> [--format=metric1,...]
Use -X
to aggregate the statistics relevant to the job allocation itself, not
taking job steps into consideration.
The main metrics code you may be interested to review are listed below.
Variable | Description |
---|---|
account |
Account the job ran under. |
avecpu |
Average CPU time of all tasks in job. |
averss |
Average resident set size of all tasks in the job. |
cputime |
Formatted (Elapsed time * CPU) count used by a job or step. |
elapsed |
Jobs elapsed time formated as DD-HH:MM:SS. |
exitcode |
The exit code returned by the job script or salloc. |
jobid |
The id of the Job. |
jobname |
The name of the Job. |
maxdiskread |
Maximum number of bytes read by all tasks in the job. |
maxdiskwrite |
Maximum number of bytes written by all tasks in the job. |
maxrss |
Maximum resident set size of all tasks in the job. |
ncpus |
Amount of allocated CPUs. |
nnodes |
The number of nodes used in a job. |
ntasks |
Number of tasks in a job. |
priority |
Slurm priority. |
qos |
Quality of service. |
reqcpu |
Required number of CPUs |
reqmem |
Required amount of memory for a job. |
reqtres |
Required Trackable RESources (TRES) |
user |
Username |
A full list of variables that specify data handled by sacct can be
found with the --helpformat
flag or by visiting the slurm documentation on
sacct
.