GPU jobs
GPU Partitions¶
In order to access the resources (nodes) where GPU cards are installed, your job has to be submitted into one of the following partitions:
- testing (one NVidia A100, very short/testing runs)
- gpu (up to four NVidia A100's, production runs)
Specific parameters of GPU/hybrid jobs¶
Option | Description |
---|---|
--partition=gpu |
Request a GPU partition for your job. |
-G ? |
Allocate the given number of GPUs for the job. |
--mem-per-gpu=?GB |
Set the specific memory requirements per one GPU. |
Please read man sbatch
for more options.
GPU job example¶
As an example, let's look at this minimalistic GPU batch job script launching an CUDA compiled application on 2 GPU cards:
login01:~$ cat gpu_run.sh
#!/bin/bash
#SBATCH -p gpu
#SBATCH -G 2
#SBATCH -o output.txt
#SBATCH -e output.txt
module load cuda/12.0.1
./jacobi -nx 45000 -ny 45000 -niter 10000
Note, since we are not specifying the project account, the job will run on user's default account. Both stderr and stdout are redirected into one file, called output.txt.
GPU Usage Monitoring¶
You can use the nvidia-smi
tool to display the information about GPU utilization by your applications. In order to do that
you have to log into the specific node running the application.
In this example, we will launch the above mentioned script, find out it is running on node n142:
login01:~$ sbatch gpu_run.sh
sbatch: slurm_job_submit: Set partition to: gpu
sbatch: slurm_job_submit: Job's time limit was set to partition limit of 2880 minutes.
Submitted batch job ***41029***
login01:~$ squeue -u <username>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
***41029*** gpu run.sh user R 0:04 1 ***n142***
Then we can connect to the node and run nvidia-smi
.
Nvidia-smi command and GPU usage
login01:~$ ssh n142
Last login: Tue Oct 3 11:41:43 2023 from login01.devana.local
n142:~$ nvidia-smi
Wed Oct 4 09:59:53 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:17:00.0 Off | 0 |
| N/A 42C P0 249W / 400W | 8179MiB / 40960MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:31:00.0 Off | 0 |
| N/A 44C P0 237W / 400W | 8179MiB / 40960MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 40679 C ./jacobi_test 8146MiB |
| 1 N/A N/A 40679 C ./jacobi_test 8146MiB |
+-----------------------------------------------------------------------------+
Restricted SSH access
Pleas note that you can directly access only nodes, where your application is running. When the job is finished, your connection will be terminated as well.