Nvidia CUDA Compiler¶

Nvidia CUDA Compiler driver NVCCC is a proprietary, LLVM based, compiler by Nvidia intended for use with CUDA codes on both the CPU and GPU. NVCC separates these two parts and sends host code (the part of code which will be run on the CPU) to a C compiler (GCC, Intel C++ Compiler or Microsoft Visual C++ Compiler), and the device code (the part which will run on the GPU) to the GPU. The device code is further compiled by NVCC.

Versions¶

Following versions of CUDA are currently available:

CUDA/11.3.1CUDA/11.4.1CUDA/12.0.0

You can load selected version as a module by following commands:

module load CUDA/11.3.1

You can load selected version as a module by following commands:

module load CUDA/11.4.1

You can load selected version as a module by following commands:

module load CUDA/12.0.0

Compiling with CUDA¶

CUDA code can be compiled directly on the login nodes and users do not have to compute nodes with GPU acceleration for the compilation. CUDA code can be compiled using the NVCC compiler. Thorough description of NVCC command options can be found on the Nvidia websites dedicated to CUDA compiler.

login01:~$ nvcc --version

NVCC comes with a large set of examples that cover both, elementary and intermediate, compilation codes. To test the compilation users should copy these examples to their home directory. To compile the code change directory to the particular example and run make command to start the compilation.

login01:~$ mkdir nvcc-tests
login01:~$ cp -r /storage-apps/easybuild/software/CUDA/<version>/samples/ ~/nvcc-tests/

login01:~$ cd nvcc-tests/1_Utilities/deviceQuery
login01:~/nvcc-tests/1_Utilities/deviceQuery$ make

To run the code, the user can use interactive slurm session to get access to a node with the Nvidia GPU and execute the binary file:

login01:~/nvcc-tests/1_Utilities/deviceQuery$ srun --job-name=nvcc-tests --partition=ngpu -G 1 --pty bash
login01:~/nvcc-tests/1_Utilities/deviceQuery$ ml CUDA
login01:~/nvcc-tests/1_Utilities/deviceQuery/deviceQuery

The expected output of the deviceQuery example executed on a node with a Nvidia SXM A100 40 GB is:

Cuda output

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA A100-SXM4-40GB"
  CUDA Driver Version / Runtime Version          12.0 / 11.4
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 40370 MBytes (42331013120 bytes)
  (108) Multiprocessors, (064) CUDA Cores/MP:    6912 CUDA Cores
  GPU Max Clock rate:                            1410 MHz (1.41 GHz)
  Memory Clock rate:                             1215 Mhz
  Memory Bus Width:                              5120-bit
  L2 Cache Size:                                 41943040 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 5 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 23 / 0
  Compute Mode:
   < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.0, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS

Code Example¶

Following is CUDA based code for two vector addition. To test it copy the code to cuda-vector-addition.cu and compile it with nvcc cuda-vector-addition.cu -o cuda-vector-addition. To execute the code open an interactive Slurm session on GPU accelerated node, as described above.

CUDA vector addition code

#define N (2048*2048)
#define THREADS_PER_BLOCK 512

#include <stdio.h>
#include <stdlib.h>

// Add vectors on GPU
__global__ void add_gpu(int *x, int *y, int *z, int n) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    if (index < n)
        z[index] = x[index] + y[index];
}

// Add vectors on CPU
void add_cpu(int *x, int *y, int *z, int n) {
    for (int j = 0; j < n; j++)
        z[j] = x[j] + y[j];
}

// Generate vectors 
void random_ints(int *x, int n) {
    for (int j = 0; j < n; j++)
        x[j] = rand() % 10000; // random number between 0 and 9999
}

// Compare vectors on CPU 
int compare_ints(int *x, int *y, int n) {
    int pass = 0;
    for (int j = 0; j < N; j++) {
        if (x[j] != y[j]) {
            printf("Value differ at location %d, with values of %d and %d\n", j, x[j], y[j]);
            pass = 1;
        }
    }
    if (pass == 0)
        printf("Success\n");
    else
        printf("Fail\n");
    return pass;
}

int main(void) {

    int *x, *y, *z; // CPU copies x, y, z
    int *dev_x, *dev_y, *dev_z; // GPU copies x, y, z
    int size = N * sizeof(int); // Create N dimensional space for N integers

    // Allocate GPU copies of dev_x, dev_y, dev_z
    cudaMalloc((void**)&dev_x, size);
    cudaMalloc((void**)&dev_y, size);
    cudaMalloc((void**)&dev_z, size);

    // Allocate CPU copies of x, y, z
    x = (int*)malloc(size);
    y = (int*)malloc(size);
    z = (int*)malloc(size);

    // Create input vector with random integer numbers
    random_ints(x, N);
    random_ints(y, N);

    // Copy inputs to GPU
    cudaMemcpy(dev_x, x, size, cudaMemcpyHostToDevice);
    cudaMemcpy(dev_y, y, size, cudaMemcpyHostToDevice);

    // Launch add_gpu() kernel with blocks and threads
    add_gpu<<<N / THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(dev_x, dev_y, dev_z, N);

    // Copy GPU result back to CPU copy of z
    cudaMemcpy(z, dev_z, size, cudaMemcpyDeviceToHost);

    // Compare with CPU results
    int *z_h;
    z_h = (int*)malloc(size);
    add_cpu(x, y, z_h, N);
    compare_ints(z, z_h, N);

    // Clean CPU memory
    free(x);
    free(y);
    free(z);
    free(z_h);

    // Clean GPU memory
    cudaFree(dev_x);
    cudaFree(dev_y);
    cudaFree(dev_z);

    return 0;
}

Created by: Marek Štekláč