Skip to content

SLURM: Cluster Management and Job Scheduling System

SLURM (Simple Linux Utility for Resource Management) is an open-source, fault-tolerant, and highly scalable job scheduling system designed for managing compute resources in both small and large Linux clusters. SLURM operates without requiring kernel modifications, making it relatively self-contained.

It has three core functions:

  1. Resource Allocation: SLURM allocates exclusive and/or non-exclusive access to compute nodes for a defined period, allowing users to perform work.

  2. Job Execution and Monitoring: It provides a framework to start, execute, and monitor jobs (often parallel jobs) on the allocated nodes.

  3. Resource Arbitration: SLURM manages job queues and resolves contention for cluster resources by managing a job scheduling queue.

Useful Commands

SLURM provides a rich set of commands for managing resources and jobs. You can access the man pages for more detailed documentation or use the --help flag with any command for a brief summary of options.

Most Used Commands

  • sinfo: Displays the state of partitions and nodes managed by SLURM. It includes filtering, sorting, and formatting options.

  • squeue: Shows the status of jobs or job steps. The default output shows running jobs in priority order, followed by pending jobs.

  • srun: Used to submit a job or start a job step in real-time. It allows specifying resource requirements such as processor count, memory, disk space, etc. It supports sequential and parallel job steps.

  • sbatch: Submits a job script for later execution. The script typically contains one or more srun commands to launch parallel tasks.

  • scancel: Cancels a pending or running job or job step. It can also send arbitrary signals to processes associated with the job.

Additional Commands

  • seff: Generates a job efficiency report for jobs that have completed and exited the queue.

  • sacct: Reports job or job step accounting information for active or completed jobs.

  • sprojects: Displays information about user projects and the remaining CPU and GPU allocation.

  • sprio: Provides a detailed view of the factors affecting a job's priority.

  • sshare: Displays information about fairshare usage in the cluster. Note: This is only applicable when using the priority/multifactor plugin.

  • salloc: Allocates resources for a job in real-time and spawns a shell. This shell is used to execute srun commands to launch parallel tasks.

  • sattach: Allows attaching standard input, output, and error streams, as well as signal capabilities, to a running job or job step.

  • sbcast: Transfers files from local disk to local disk on the nodes allocated to a job. This is particularly useful for diskless compute nodes.

  • sstat: Displays resource utilization information for running jobs or job steps.

  • scontrol: An administrative tool used to view and modify SLURM states. Some scontrol commands require root user permissions.

Important Notes

SLURM Commands Are Case-Sensitive

Always use the correct command and option syntax. Run man <command> or <command> --help for full details.

Best Practices for Efficient Job Management

  • Request only the resources you need (CPU, RAM, GPU) to avoid long queue times.
  • Use job arrays to run multiple jobs efficiently.
  • Check job status regularly with squeue and optimize your scripts.

For further details on SLURM job submission strategies, visit our Job Submission Guide.

Created by: marek.steklac