SLURM: Cluster Management and Job Scheduling System¶
SLURM (Simple Linux Utility for Resource Management) is an open-source, fault-tolerant, and highly scalable job scheduling system designed for managing compute resources in both small and large Linux clusters. SLURM operates without requiring kernel modifications, making it relatively self-contained.
It has three core functions:
-
Resource Allocation: SLURM allocates exclusive and/or non-exclusive access to compute nodes for a defined period, allowing users to perform work.
-
Job Execution and Monitoring: It provides a framework to start, execute, and monitor jobs (often parallel jobs) on the allocated nodes.
-
Resource Arbitration: SLURM manages job queues and resolves contention for cluster resources by managing a job scheduling queue.
Useful Commands¶
SLURM provides a rich set of commands for managing resources and jobs. You can access the man pages for more detailed documentation or use the --help
flag with any command for a brief summary of options.
Most Used Commands¶
-
sinfo: Displays the state of partitions and nodes managed by SLURM. It includes filtering, sorting, and formatting options.
-
squeue: Shows the status of jobs or job steps. The default output shows running jobs in priority order, followed by pending jobs.
-
srun: Used to submit a job or start a job step in real-time. It allows specifying resource requirements such as processor count, memory, disk space, etc. It supports sequential and parallel job steps.
-
sbatch: Submits a job script for later execution. The script typically contains one or more
srun
commands to launch parallel tasks. -
scancel: Cancels a pending or running job or job step. It can also send arbitrary signals to processes associated with the job.
Additional Commands¶
-
seff: Generates a job efficiency report for jobs that have completed and exited the queue.
-
sacct: Reports job or job step accounting information for active or completed jobs.
-
sprojects: Displays information about user projects and the remaining CPU and GPU allocation.
-
sprio: Provides a detailed view of the factors affecting a job's priority.
-
sshare: Displays information about fairshare usage in the cluster. Note: This is only applicable when using the priority/multifactor plugin.
-
salloc: Allocates resources for a job in real-time and spawns a shell. This shell is used to execute
srun
commands to launch parallel tasks. -
sattach: Allows attaching standard input, output, and error streams, as well as signal capabilities, to a running job or job step.
-
sbcast: Transfers files from local disk to local disk on the nodes allocated to a job. This is particularly useful for diskless compute nodes.
-
sstat: Displays resource utilization information for running jobs or job steps.
-
scontrol: An administrative tool used to view and modify SLURM states. Some
scontrol
commands require root user permissions.
Important Notes¶
SLURM Commands Are Case-Sensitive
Always use the correct command and option syntax. Run man <command>
or <command> --help
for full details.
Best Practices for Efficient Job Management
- Request only the resources you need (
CPU
,RAM
,GPU
) to avoid long queue times. - Use job arrays to run multiple jobs efficiently.
- Check job status regularly with
squeue
and optimize your scripts.
For further details on SLURM job submission strategies, visit our Job Submission Guide.