Job States
Job Status and Reason Codes¶
The squeue command provides detailed information about an active job’s status, including job state codes and job reason codes.
- Job state codes describe a job’s current state in the queue (e.g., pending, completed, running).
- Job reason codes explain why a job is in a specific state.
The following tables outline a variety of job state and reason codes you may encounter when using squeue to check on your jobs.
Job State Codes¶
Status | Code | Explaination |
---|---|---|
CANCELLED | CA |
The job was explicitly cancelled by the user or system administrator. |
COMPLETED | CD |
The job has completed successfully. |
COMPLETING | CG |
The job is finishing but some processes are still active. |
DEADLINE | DL |
The job terminated on deadline |
FAILED | F |
The job terminated with a non-zero exit code and failed to execute. |
NODE_FAIL | NF |
The job terminated due to failure of one or more allocated nodes |
OUT_OF_MEMORY | OOM |
The Job experienced an out of memory error. |
PENDING | PD |
The job is waiting for resource allocation. It will eventually run. |
PREEMPTED | PR |
The job was terminated because of preemption by another job. |
RUNNING | R |
The job currently is allocated to a node and is running. |
SUSPENDED | S |
A running job has been stopped with its cores released to other jobs. |
STOPPED | ST |
A running job has been stopped with its cores retained. |
TIMEOUT | TO |
Job terminated upon reaching its time limit. |
See the squeue documentation for more information or type squeue --help
, and/or sacct documentation documentation or type sacct --help
.
Job debugging
If the job terminated with a non-zero exit code it can be beneficial to add -x
flag to script header to enable higher verbosity.
#!/bin/bash -x
#SBATCH ...
Job Reason Codes¶
Reason Code | Explanation |
---|---|
Priority |
One or more higher priority jobs is in queue for running. Your job will eventually run. |
Dependency |
This job is waiting for a dependent job to complete and will run afterwards. |
Resources |
The job is waiting for resources to become available and will eventually run. |
InvalidAccount |
The job’s account is invalid. Cancel the job and rerun with correct account. |
InvaldQoS |
The job’s QoS is invalid. Cancel the job and rerun with correct account. |
QOSGrpCpuLimit |
All CPUs assigned to your job’s specified QoS are in use; job will run eventually. |
QOSGrpMaxJobsLimit |
Maximum number of jobs for your job’s QoS have been met; job will run eventually. |
QOSGrpNodeLimit |
All nodes assigned to your job’s specified QoS are in use; job will run eventually. |
PartitionCpuLimit |
All CPUs assigned to your job’s specified partition are in use; job will run eventually. |
PartitionMaxJobsLimit |
Maximum number of jobs for your job’s partition have been met; job will run eventually. |
PartitionNodeLimit |
All nodes assigned to your job’s specified partition are in use; job will run eventually. |
AssociationCpuLimit |
All CPUs assigned to your job’s specified association are in use; job will run eventually. |
AssociationMaxJobsLimit |
Maximum number of jobs for your job’s association have been met; job will run eventually. |
AssociationNodeLimit |
All nodes assigned to your job’s specified association are in use; job will run eventually. |
For a full list of job state codes, refer to the Slurm documentation.
Created by: