GROMACS
GROMACS is a molecular dynamics package with an extensive array of modeling, simulation and analysis capabilities. While primarily developed for the simulation of biochemical molecules, its broad adoption includes reaserch fields such as non-biological chemistry, metadynamics and mesoscale physics. One of the key aspects characterizing GROMACS is the strong focus on high performance and resource efficiency, making use of state-of-the-art algorithms and optimized low-level programming techniques for CPUs and GPUs.
Test case
As test case, we select the 3M atom system from the HECBioSim benchmark suite for Molecular Dynamics:
A pair of hEGFR tetramers of 1IVO and 1NQL:
* Total number of atoms = 2,997,924
* Protein atoms = 86,996 Lipid atoms = 867,784 Water atoms = 2,041,230 Ions = 1,914
The simulation is carried out using single precision, 1 MPI process per node and 12 OpenMP threads per MPI process. We measured runtimes for 4, 8, 16, 32 and 64 compute nodes. The input file to download for the test case is 3000k-atoms/benchmark.tpr.
Running the container
Assuming that the benchmark.tpr
input data is present in a directory which
Sarus is configured to automatically mount inside the container ( here referred
by the arbitrary variable $INPUT
), we can run the container on 16 nodes as
follows:
srun -C gpu -N16 srun sarus run --mpi \
ethcscs/gromacs:2018.3-cuda9.2_mpich3.1.4 \
/usr/local/gromacs/bin/mdrun_mpi -s ${INPUT}/benchmark.tpr -ntomp 12
A typical output will look like:
:-) GROMACS - mdrun_mpi, 2018.3 (-:
...
Using 4 MPI processes
Using 12 OpenMP threads per MPI process
On host nid00001 1 GPU auto-selected for this run.
Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node: PP:0
NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun 'Her1-Her1'
10000 steps, 20.0 ps.
Core t (s) Wall t (s) (%)
Time: 20878.970 434.979 4800.0
(ns/day) (hour/ns)
Performance: 3.973 6.041
GROMACS reminds you: "Shake Yourself" (YES)
If the system administrator did not configure Sarus to mount the input data
location during container setup, we can use the --mount
option:
srun -C gpu -N16 sarus run --mpi \
--mount=type=bind,src=<path-to-input-directory>,dst=/gromacs-data \
ethcscs/gromacs:2018.3-cuda9.2_mpich3.1.4 \
/usr/local/gromacs/bin/mdrun_mpi -s /gromacs-data/benchmark.tpr -ntomp 12
Running the native application
CSCS provides and supports GROMACS on Piz Daint. This documentation page gives more details on
how to run GROMACS as a native application. For this test case, the
GROMACS/2018.3-CrayGNU-18.08-cuda-9.1
modulefile was loaded.
Container image and Dockerfile
The container image ethcscs/gromacs:2018.3-cuda9.2_mpich3.1.4
(based on
cuda/9.2 and mpich/3.1) used for this test case can be pulled from CSCS
DockerHub or be rebuilt with this
Dockerfile
:
1#!/bin/sh
2FROM ethcscs/mpich:ub1804_cuda92_mpi314
3
4## Uncomment the following lines if you want to build mpi yourself:
5## RUN apt-get update \
6## && apt-get install -y --no-install-recommends \
7## wget \
8## gfortran \
9## zlib1g-dev \
10## libopenblas-dev \
11## && rm -rf /var/lib/apt/lists/*
12##
13## # Install MPICH
14## RUN wget -q http://www.mpich.org/static/downloads/3.1.4/mpich-3.1.4.tar.gz \
15## && tar xf mpich-3.1.4.tar.gz \
16## && cd mpich-3.1.4 \
17## && ./configure --disable-fortran --enable-fast=all,O3 --prefix=/usr \
18## && make -j$(nproc) \
19## && make install \
20## && ldconfig
21
22# Install CMake (apt installs cmake/3.10.2, we want a more recent version)
23RUN mkdir /usr/local/cmake \
24&& cd /usr/local/cmake \
25&& wget -q https://cmake.org/files/v3.12/cmake-3.12.4-Linux-x86_64.tar.gz \
26&& tar -xzf cmake-3.12.4-Linux-x86_64.tar.gz \
27&& mv cmake-3.12.4-Linux-x86_64 3.12.4 \
28&& rm cmake-3.12.4-Linux-x86_64.tar.gz \
29&& cd /
30
31ENV PATH=/usr/local/cmake/3.12.4/bin/:${PATH}
32
33# Install GROMACS (apt install gromacs/2018.1, we want a more recent version)
34RUN wget -q http://ftp.gromacs.org/pub/gromacs/gromacs-2018.3.tar.gz \
35&& tar xf gromacs-2018.3.tar.gz \
36&& cd gromacs-2018.3 \
37&& mkdir build && cd build \
38&& cmake -DCMAKE_BUILD_TYPE=Release \
39 -DGMX_BUILD_OWN_FFTW=ON -DREGRESSIONTEST_DOWNLOAD=ON \
40 -DGMX_MPI=on -DGMX_GPU=on -DGMX_SIMD=AVX2_256 \
41 -DGMX_BUILD_MDRUN_ONLY=on \
42 .. \
43&& make -j6 \
44&& make check \
45&& make install \
46&& cd ../.. \
47&& rm -fr gromacs-2018.3*
Required OCI hooks
NVIDIA Container Runtime hook
Native MPI hook (MPICH-based)
Results
We measure wall clock time (in seconds) and performance (in ns/day) as reported by the application logs. The speedup values are computed using the wall clock time averages for each data point, taking the native execution time at 4 nodes as baseline. The results of our experiments are illustrated in the following figure:
Comparison of wall clock execution time, performance, and speedup between native and Sarus-deployed container versions of GROMACS on Piz Daint.
We observe the container application being up to 6% faster than the native implementation, with a small but consistent performance advantage and comparable standard deviations across the different node counts.