GPU jobs

GPU Partitions¶

In order to access the resources (nodes) where GPU cards are installed, your job has to be submitted into one of the following partitions:

testing (one NVidia A100, very short/testing runs)
gpu (up to four NVidia A100's, production runs)

Specific parameters of GPU/hybrid jobs¶

Option	Description
`--partition=gpu`	Request a GPU partition for your job.
`-G ?`	Allocate the given number of GPUs for the job.
`--mem-per-gpu=?GB`	Set the specific memory requirements per one GPU.

Please read man sbatch for more options.

GPU job example¶

As an example, let's look at this minimalistic GPU batch job script launching an CUDA compiled application on 2 GPU cards:

login01:~$ cat gpu_run.sh
  #!/bin/bash
  #SBATCH -p gpu
  #SBATCH -G 2
  #SBATCH -o output.txt
  #SBATCH -e output.txt

  module load cuda/12.0.1

  ./jacobi -nx 45000 -ny 45000 -niter 10000

Note, since we are not specifying the project account, the job will run on user's default account. Both stderr and stdout are redirected into one file, called output.txt.

GPU Usage Monitoring¶

You can use the nvidia-smi tool to display the information about GPU utilization by your applications. In order to do that you have to log into the specific node running the application.

In this example, we will launch the above mentioned script, find out it is running on node n142:

login01:~$ sbatch gpu_run.sh 
  sbatch: slurm_job_submit: Set partition to: gpu
  sbatch: slurm_job_submit: Job's time limit was set to partition limit of 2880 minutes.
  Submitted batch job ***41029***

login01:~$ squeue -u <username>
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  ***41029***       gpu   run.sh  user  R       0:04      1 ***n142***

Then we can connect to the node and run nvidia-smi.

Nvidia-smi command and GPU usage

login01:~$ ssh n142
  Last login: Tue Oct  3 11:41:43 2023 from login01.devana.local
n142:~$ nvidia-smi 
  Wed Oct  4 09:59:53 2023       
  +-----------------------------------------------------------------------------+
  | NVIDIA-SMI 525.85.05    Driver Version: 525.85.05    CUDA Version: 12.0     |
  |-------------------------------+----------------------+----------------------+
  | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
  | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
  |                               |                      |               MIG M. |
  |===============================+======================+======================|
  |   0  NVIDIA A100-SXM...  On   | 00000000:17:00.0 Off |                    0 |
  | N/A   42C    P0   249W / 400W |   8179MiB / 40960MiB |    100%      Default |
  |                               |                      |             Disabled |
  +-------------------------------+----------------------+----------------------+
  |   1  NVIDIA A100-SXM...  On   | 00000000:31:00.0 Off |                    0 |
  | N/A   44C    P0   237W / 400W |   8179MiB / 40960MiB |    100%      Default |
  |                               |                      |             Disabled |
  +-------------------------------+----------------------+----------------------+

  +-----------------------------------------------------------------------------+
  | Processes:                                                                  |
  |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
  |        ID   ID                                                   Usage      |
  |=============================================================================|
  |    0   N/A  N/A     40679      C   ./jacobi_test                    8146MiB |
  |    1   N/A  N/A     40679      C   ./jacobi_test                    8146MiB |
  +-----------------------------------------------------------------------------+

Restricted SSH access

Pleas note that you can directly access only nodes, where your application is running. When the job is finished, your connection will be terminated as well.

Created by: Marek Štekláč