Skip to content

SLURM overview

SLURM is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. SLURM requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, SLURM has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Useful commands

Man pages exist for all SLURM daemons, commands, and API functions. The command option --help also provides a brief summary of options. Note that the command options are all case sensitive.

The most used commands

  • sinfo reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options.

  • squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.

  • srun is used to submit a job for execution or initiate job steps in real time, with a wide variety of options to specify resource requirements (processor count, memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared resources within the job’s node allocation.

  • sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

  • scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.

Other commands

  • seff is used to find the job efficiency report for the jobs which are completed and exited from the queue.

  • sacct is used to report job or job step accounting information about active or completed jobs.

  • sprojects is used to view the user projects and remaining CPU and GPU allocation.

  • sprio is used to display a detailed view of the components affecting a jobs priority.

  • sshare displays detailed information about fairshare usage on the cluster. Note that this is only viable when using th priority/multifactor plugin.

  • salloc is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.

  • sattach is used to attach standard input, output, and error plus signal capabilities to a currently running job or job step. One can attach to and detach from jobs multiple times.

  • sbcast is used to transfer a file from local disk to local disk on the nodes allocated to a job. This can be used to effectively use diskless compute nodes or provide improved performance relative to a shared file system.

  • sstat is used to get information about the resources utilized by a running job or job step.

  • scontrol is the administrative tool used to view and/or modify SLURM state. Note that many scontrol commands can only be executed as user root.