Skip to content

GPU jobs

GPU Partitions

In order to access the resources (nodes) where GPU cards are installed, your job has to be submitted into one of the following partitions:

  • testing (one NVidia A100, very short/testing runs)
  • gpu (up to four NVidia A100's, production runs)

Specific parameters of GPU/hybrid jobs

Option Description
--partition=gpu Request a GPU partition for your job.
-G ? Allocate the given number of GPUs for the job.
--mem-per-gpu=?GB Set the specific memory requirements per one GPU.

Please read man sbatch for more options.

GPU job example

As an example, let's look at this minimalistic GPU batch job script launching an CUDA compiled application on 2 GPU cards:

demovic@login02 jacobi > cat run.sh
#!/bin/bash
#SBATCH -p gpu
#SBATCH -G 2
#SBATCH -o output.txt
#SBATCH -e output.txt

module load cuda/12.0.1

./jacobi -nx 45000 -ny 45000 -niter 10000

Note, since we are not specifying the project account, the job will run on user's default account. Both stderr and stdout are redirected into one file, called output.txt.

GPU usage monitoring

You can use the nvidia-smi tool to display the information about GPU utilization by your applications. In order to do that you have to log into the specific node running the application.

In this example, we will launch the above mentioned script, find out it is running on node n142:

demovic@login02 jacobi > sbatch run.sh 
sbatch: slurm_job_submit: Set partition to: gpu
sbatch: slurm_job_submit: Job's time limit was set to partition limit of 2880 minutes.
Submitted batch job 41029

demovic@login02 jacobi > squeue -u demovic
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             41029       gpu   run.sh  demovic  R       0:04      1 n142

Then we can connect to the node and run nvidia-smi:

demovic@login02 jacobi > ssh n142
Last login: Tue Oct  3 11:41:43 2023 from login01.devana.local
demovic@n142 ~ > nvidia-smi 
Wed Oct  4 09:59:53 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05    Driver Version: 525.85.05    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:17:00.0 Off |                    0 |
| N/A   42C    P0   249W / 400W |   8179MiB / 40960MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:31:00.0 Off |                    0 |
| N/A   44C    P0   237W / 400W |   8179MiB / 40960MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     40679      C   ./jacobi_test                    8146MiB |
|    1   N/A  N/A     40679      C   ./jacobi_test                    8146MiB |
+-----------------------------------------------------------------------------+

We can thus verify, the job is fully utilizing both allocated GPU's. Please note, you can directly access only nodes, where your application is running. When the job is finished, your connection will be terminated as well.

Created by: Filip Holka