GROMACS

GROMACS is a molecular dynamics package with an extensive array of modeling, simulation and analysis capabilities. While primarily developed for the simulation of biochemical molecules, its broad adoption includes reaserch fields such as non-biological chemistry, metadynamics and mesoscale physics. One of the key aspects characterizing GROMACS is the strong focus on high performance and resource efficiency, making use of state-of-the-art algorithms and optimized low-level programming techniques for CPUs and GPUs.

Test case

As test case, we select the 3M atom system from the HECBioSim benchmark suite for Molecular Dynamics:

A pair of hEGFR tetramers of 1IVO and 1NQL:
    * Total number of atoms = 2,997,924
    * Protein atoms = 86,996  Lipid atoms = 867,784  Water atoms = 2,041,230  Ions = 1,914

The simulation is carried out using single precision, 1 MPI process per node and 12 OpenMP threads per MPI process. We measured runtimes for 4, 8, 16, 32 and 64 compute nodes. The input file to download for the test case is 3000k-atoms/benchmark.tpr.

Running the container

Assuming that the benchmark.tpr input data is present in a directory which Sarus is configured to automatically mount inside the container ( here referred by the arbitrary variable $INPUT ), we can run the container on 16 nodes as follows:

srun -C gpu -N16 srun sarus run --mpi \
    ethcscs/gromacs:2018.3-cuda9.2_mpich3.1.4 \
    /usr/local/gromacs/bin/mdrun_mpi -s ${INPUT}/benchmark.tpr -ntomp 12

A typical output will look like:

              :-) GROMACS - mdrun_mpi, 2018.3 (-:
...
Using 4 MPI processes
Using 12 OpenMP threads per MPI process

On host nid00001 1 GPU auto-selected for this run.
Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node: PP:0
NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun 'Her1-Her1'
10000 steps,     20.0 ps.

               Core t (s)   Wall t (s)        (%)
       Time:    20878.970      434.979     4800.0
                 (ns/day)    (hour/ns)
Performance:        3.973        6.041

GROMACS reminds you: "Shake Yourself" (YES)

If the system administrator did not configure Sarus to mount the input data location during container setup, we can use the --mount option:

srun -C gpu -N16 sarus run --mpi \
    --mount=type=bind,src=<path-to-input-directory>,dst=/gromacs-data \
    ethcscs/gromacs:2018.3-cuda9.2_mpich3.1.4 \
    /usr/local/gromacs/bin/mdrun_mpi -s /gromacs-data/benchmark.tpr -ntomp 12

Running the native application

CSCS provides and supports GROMACS on Piz Daint. This documentation page gives more details on how to run GROMACS as a native application. For this test case, the GROMACS/2018.3-CrayGNU-18.08-cuda-9.1 modulefile was loaded.

Container image and Dockerfile

The container image ethcscs/gromacs:2018.3-cuda9.2_mpich3.1.4 (based on cuda/9.2 and mpich/3.1) used for this test case can be pulled from CSCS DockerHub or be rebuilt with this Dockerfile:

#!/bin/sh
FROM ethcscs/mpich:ub1804_cuda92_mpi314

## Uncomment the following lines if you want to build mpi yourself:
## RUN apt-get update \
## && apt-get install -y --no-install-recommends \
##         wget \
##         gfortran \
##         zlib1g-dev \
##         libopenblas-dev \
## && rm -rf /var/lib/apt/lists/*
## 
## # Install MPICH
## RUN wget -q http://www.mpich.org/static/downloads/3.1.4/mpich-3.1.4.tar.gz \
## && tar xf mpich-3.1.4.tar.gz \
## && cd mpich-3.1.4 \
## && ./configure --disable-fortran --enable-fast=all,O3 --prefix=/usr \
## && make -j$(nproc) \
## && make install \
## && ldconfig

# Install CMake (apt installs cmake/3.10.2, we want a more recent version)
RUN mkdir /usr/local/cmake \
&& cd /usr/local/cmake \
&& wget -q https://cmake.org/files/v3.12/cmake-3.12.4-Linux-x86_64.tar.gz \
&& tar -xzf cmake-3.12.4-Linux-x86_64.tar.gz \
&& mv cmake-3.12.4-Linux-x86_64 3.12.4 \
&& rm cmake-3.12.4-Linux-x86_64.tar.gz \
&& cd /

ENV PATH=/usr/local/cmake/3.12.4/bin/:${PATH}

# Install GROMACS (apt install gromacs/2018.1, we want a more recent version)
RUN wget -q http://ftp.gromacs.org/pub/gromacs/gromacs-2018.3.tar.gz \
&& tar xf gromacs-2018.3.tar.gz \
&& cd gromacs-2018.3 \
&& mkdir build && cd build \
&& cmake -DCMAKE_BUILD_TYPE=Release  \
         -DGMX_BUILD_OWN_FFTW=ON -DREGRESSIONTEST_DOWNLOAD=ON \
         -DGMX_MPI=on -DGMX_GPU=on -DGMX_SIMD=AVX2_256 \
         -DGMX_BUILD_MDRUN_ONLY=on \
         .. \
&& make -j6 \
&& make check \
&& make install \
&& cd ../.. \
&& rm -fr gromacs-2018.3*

Required OCI hooks

NVIDIA Container Runtime hook
Native MPI hook (MPICH-based)

Results

We measure wall clock time (in seconds) and performance (in ns/day) as reported by the application logs. The speedup values are computed using the wall clock time averages for each data point, taking the native execution time at 4 nodes as baseline. The results of our experiments are illustrated in the following figure:

We observe the container application being up to 6% faster than the native implementation, with a small but consistent performance advantage and comparable standard deviations across the different node counts.