Nvidia CUDA Compiler¶
Nvidia CUDA Compiler driver NVCCC is a proprietary, LLVM based, compiler by Nvidia intended for use with CUDA codes on both the CPU and GPU. NVCC separates these two parts and sends host code (the part of code which will be run on the CPU) to a C compiler (GCC, Intel C++ Compiler or Microsoft Visual C++ Compiler), and the device code (the part which will run on the GPU) to the GPU. The device code is further compiled by NVCC.
Versions¶
Following versions of CUDA are currently available:
You can load selected version as a module by following commands:
module load CUDA/11.3.1
You can load selected version as a module by following commands:
module load CUDA/11.4.1
You can load selected version as a module by following commands:
module load CUDA/12.0.0
Compiling with CUDA¶
CUDA code can be compiled directly on the login nodes and users do not have to compute nodes with GPU acceleration for the compilation. CUDA code can be compiled using the NVCC compiler. Thorough description of NVCC command options can be found on the Nvidia websites dedicated to CUDA compiler.
login01:~$ nvcc --version
NVCC comes with a large set of examples that cover both, elementary and intermediate, compilation codes. To test the compilation users should copy these examples to their home directory. To compile the code change directory to the particular example and run make
command to start the compilation.
login01:~$ mkdir nvcc-tests
login01:~$ cp -r /storage-apps/easybuild/software/CUDA/<version>/samples/ ~/nvcc-tests/
login01:~$ cd nvcc-tests/1_Utilities/deviceQuery
login01:~/nvcc-tests/1_Utilities/deviceQuery$ make
To run the code, the user can use interactive slurm session to get access to a node with the Nvidia GPU and execute the binary file:
login01:~/nvcc-tests/1_Utilities/deviceQuery$ srun --job-name=nvcc-tests --partition=ngpu -G 1 --pty bash
login01:~/nvcc-tests/1_Utilities/deviceQuery$ ml CUDA
login01:~/nvcc-tests/1_Utilities/deviceQuery/deviceQuery
The expected output of the deviceQuery example executed on a node with a Nvidia SXM A100 40 GB is:
Cuda output
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA A100-SXM4-40GB"
CUDA Driver Version / Runtime Version 12.0 / 11.4
CUDA Capability Major/Minor version number: 8.0
Total amount of global memory: 40370 MBytes (42331013120 bytes)
(108) Multiprocessors, (064) CUDA Cores/MP: 6912 CUDA Cores
GPU Max Clock rate: 1410 MHz (1.41 GHz)
Memory Clock rate: 1215 Mhz
Memory Bus Width: 5120-bit
L2 Cache Size: 41943040 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 167936 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 5 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 23 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.0, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS
Code Example¶
Following is CUDA based code for two vector addition. To test it copy the code to cuda-vector-addition.cu and compile it with nvcc cuda-vector-addition.cu -o cuda-vector-addition
. To execute the code open an interactive Slurm session on GPU accelerated node, as described above.
CUDA vector addition code
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
#include <stdio.h>
#include <stdlib.h>
// Add vectors on GPU
__global__ void add_gpu(int *x, int *y, int *z, int n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
z[index] = x[index] + y[index];
}
// Add vectors on CPU
void add_cpu(int *x, int *y, int *z, int n) {
for (int j = 0; j < n; j++)
z[j] = x[j] + y[j];
}
// Generate vectors
void random_ints(int *x, int n) {
for (int j = 0; j < n; j++)
x[j] = rand() % 10000; // random number between 0 and 9999
}
// Compare vectors on CPU
int compare_ints(int *x, int *y, int n) {
int pass = 0;
for (int j = 0; j < N; j++) {
if (x[j] != y[j]) {
printf("Value differ at location %d, with values of %d and %d\n", j, x[j], y[j]);
pass = 1;
}
}
if (pass == 0)
printf("Success\n");
else
printf("Fail\n");
return pass;
}
int main(void) {
int *x, *y, *z; // CPU copies x, y, z
int *dev_x, *dev_y, *dev_z; // GPU copies x, y, z
int size = N * sizeof(int); // Create N dimensional space for N integers
// Allocate GPU copies of dev_x, dev_y, dev_z
cudaMalloc((void**)&dev_x, size);
cudaMalloc((void**)&dev_y, size);
cudaMalloc((void**)&dev_z, size);
// Allocate CPU copies of x, y, z
x = (int*)malloc(size);
y = (int*)malloc(size);
z = (int*)malloc(size);
// Create input vector with random integer numbers
random_ints(x, N);
random_ints(y, N);
// Copy inputs to GPU
cudaMemcpy(dev_x, x, size, cudaMemcpyHostToDevice);
cudaMemcpy(dev_y, y, size, cudaMemcpyHostToDevice);
// Launch add_gpu() kernel with blocks and threads
add_gpu<<<N / THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(dev_x, dev_y, dev_z, N);
// Copy GPU result back to CPU copy of z
cudaMemcpy(z, dev_z, size, cudaMemcpyDeviceToHost);
// Compare with CPU results
int *z_h;
z_h = (int*)malloc(size);
add_cpu(x, y, z_h, N);
compare_ints(z, z_h, N);
// Clean CPU memory
free(x);
free(y);
free(z);
free(z_h);
// Clean GPU memory
cudaFree(dev_x);
cudaFree(dev_y);
cudaFree(dev_z);
return 0;
}