NVIDIA GPUDirect RDMA

GPUDirect is a technology that enables direct RDMA to and from GPU memory. This means that multiple GPUs can directly read and write CUDA host and device memory, without resorting to the use of host memory or the CPU, resulting in significant data transfer performance improvements.

We will show here that Sarus is able to leverage the GPUDirect technology.

Test case

This sample C++ code performs an MPI_Allgather operation using CUDA device memory and GPUDirect. If the operation is carried out successfully, the program prints a success message to standard output.

Running the container

Before running this code with Sarus, two environment variables must be set: MPICH_RDMA_ENABLED_CUDA and LD_PRELOAD

MPICH_RDMA_ENABLED_CUDA: allows the MPI application to pass GPU
pointers directly to point-to-point and collective communication functions,
as well as blocking collective communication functions.

LD_PRELOAD: allows to load the specified cuda library from the
compute node before all others.

This can be done by passing a string command to bash:

srun -C gpu -N4 -t2 sarus run --mpi \
    ethcscs/mpich:ub1804_cuda92_mpi314_gpudirect-all_gather
    bash -c 'MPICH_RDMA_ENABLED_CUDA=1 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libcuda.so ./all_gather'

A successful output looks like:

Success!

Running the native application

export MPICH_RDMA_ENABLED_CUDA=1
srun -C gpu -N4 -t2 ./all_gather

A successful output looks like:

Success!

Container image and Dockerfile

The container image (based on cuda/9.2 and mpich/3.1.4) used for this test case can be pulled from CSCS DockerHub or be rebuilt with this Dockerfile:

1
2
3
4
5
6
7
# docker build -f Dockerfile -t \
#   ethcscs/mpich:ub1804_cuda92_mpi314_gpudirect-all_gather .
FROM ethcscs/mpich:ub1804_cuda92_mpi314

COPY all_gather.cpp /opt/mpi_gpudirect/all_gather.cpp
WORKDIR /opt/mpi_gpudirect
RUN mpicxx -g all_gather.cpp -o all_gather -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcudart

Required OCI hooks

  • NVIDIA Container Runtime hook
  • Native MPI hook (MPICH-based)