Basic end-to-end user workflow
Steps to do on your workstation:
Build the container image using the Docker tool and Dockerfiles.
Push the Docker image into Docker Hub registry (https://hub.docker.com).
Steps to do on the HPC system:
Pull the Docker image from the Docker Hub registry using Sarus.
Run the image at scale with Sarus.
An explanation of the different steps required to deploy a Docker image using Sarus follows.
1. Develop the Docker image
First, the user has to build a container image. This boils down to writing a Dockerfile that describes the container, executing the Docker command line tool to build the image, and then running the container to test it. Below you can find an overview of what the development process looks like. We provide a brief introduction to using Docker, however, for a detailed explanation please refer to the Docker Get Started guide.
Let’s start with a simple example. Consider the following Dockerfile where we install Python on a Debian Jessie base image:
FROM debian:jessie RUN apt-get -y update && apt-get install -y python
Once that the user has written the Dockerfile, the container image can be built:
$ docker build -t hello-python .
The previous step will take the content of the Dockerfile and build our
container image. The first entry,
FROM debian:jessie, will specify a base
Linux distribution image as a starting point for our container (in this case,
Debian Jessie), the image of which Docker will try to fetch from its registry if
it is not already locally available. The second entry
RUN apt-get -y update &&
apt-get install -y python will execute the command that follows the
instruction, updating the container with the changes done to its software
environment: in this case, we are installing Python using the apt-get package
Once the build step has finished, we can list the available container images using:
$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE hello-python latest 2e57316387c6 3 minutes ago 224 MB
We can now spawn a container from the image we just built (tagged as
hello-python), and run Python inside the container
python --version), so as to verify the version of the installed
$ docker run --rm hello-python python --version Python 2.7.9
One of the conveniences of building containers with Docker is that this process can be carried out solely on the user’s workstation/laptop, enabling quick iterations of building, modifying, and testing of containerized applications that can then be easily deployed on a variety of systems, greatly improving user productivity.
Reducing the size of the container image, besides saving disk space, also speeds up the process of importing it into Sarus later on. The easiest ways to limit image size are cleaning the package manager cache after installations and deleting source codes and other intermediate build artifacts when building software manually. For practical examples and general good advice on writing Dockerfiles, please refer to the official Best practices for writing Dockerfiles.
A user can also access the container interactively through a shell, enabling quick testing of commands. This can be a useful step for evaluating commands before adding them to the Dockerfile:
$ docker run --rm -it hello-python bash root@c5fc1954b19d:/# python --version Python 2.7.9 root@c5fc1954b19d:/#
2. Push the Docker image to the Docker Hub
Once the image has been built and tested, you can log in to DockerHub (requires an account) and push it, so that it becomes available from the cloud:
$ docker login $ docker push <user name>/<repo name>:<image tag>
Note that in order for the push to succeed, the image has to be
correctly tagged with the same
<repository name>/<image name>:<image tag>
identifier you intend to push to. Images can be tagged at build-time, supplying
-t option to docker build, or afterwards by using
docker tag. In the case of our example:
$ docker tag hello-python <repo name>/hello-python:1.0
The image tag (the last part of the identifier after the colon) is optional.
If absent, Docker will set the tag to
latest by default.
3. Pull the Docker image from Docker Hub
Now the image is stored in the Docker Hub and you can pull it into the HPC system using the sarus pull command followed by the image identifier:
$ sarus pull <repo name>/hello-python:1.0
While performing the pull does not require specific privileges, it is generally advisable to run sarus pull on the system’s compute nodes through the workload manager: compute nodes often have better hardware and, in some cases like Cray XC systems, large RAM filesystems, which will greatly reduce the pull process time and will allow to pull larger images.
Should you run into problems because the pulled image doesn’t fit in the default
filesystem, you can specify an alternative temporary directory with the
You can use sarus images to list the images available on the system:
$ sarus images REPOSITORY TAG DIGEST CREATED SIZE SERVER <repo name>/hello-python 1.0 6bc9d2cd1831 2018-01-19T09:43:04 40.16MB index.docker.io
4. Run the image at scale with Sarus
Once the image is available to Sarus we can run it at scale using the workload manager. For example, if using SLURM:
$ srun -N 1 sarus run <repo name>/hello-python:1.0 python --version Python 2.7.9
As with Docker, containers can also be used through a terminal, enabling quick testing of commands:
$ srun -N 1 --pty sarus run -t debian bash $ cat /etc/os-release PRETTY_NAME="Debian GNU/Linux 9 (stretch)" NAME="Debian GNU/Linux" VERSION_ID="9" VERSION="9 (stretch)" ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/" $ exit
--pty option to
srun and the
-t/--tty option to
are needed to properly setup the pseudo-terminals in order to achieve a familiar
You can tell the previous example was run inside a container by querying the specifications of your host system. For example, the OS of Cray XC compute nodes is based on SLES and not Debian:
$ srun -N 1 cat /etc/os-release NAME="SLES" VERSION="12" VERSION_ID="12" PRETTY_NAME="SUSE Linux Enterprise Server 12" ID="sles" ANSI_COLOR="0;32" CPE_NAME="cpe:/o:suse:sles:12"
Pulling images from private or 3rd party registries
By default, Sarus tries to pull images from DockerHub
public repositories. To perform a pull from a private repository, use the
--login option to the sarus pull command and enter your
$ srun -N 1 sarus pull --login <privateRepo>/<image>:<tag> username :user password : ...
To pull an image from a registry different from Docker Hub, enter the server address as part of the image identifier. For example, to access the NVIDIA GPU Cloud:
$ srun -N 1 sarus pull --login nvcr.io/nvidia/k8s/cuda-sample:nbody username :$oauthtoken password : ...
To work with images not pulled from Docker Hub (including the removal detailed in a later section), you need to enter the image descriptor (repository[:tag]) as displayed by the sarus images command in the first two columns:
$ sarus images REPOSITORY TAG DIGEST CREATED SIZE SERVER nvcr.io/nvidia/k8s/cuda-sample nbody 29e2298d9f71 2019-01-14T12:22:25 91.88MB nvcr.io $ srun -N1 sarus run nvcr.io/nvidia/k8s/cuda-sample:nbody cat /usr/local/cuda/version.txt CUDA Version 9.0.176
Sarus has been designed to be compatible with Docker Hub, including its specific authentication scheme. To work with registries which adopt different authentication procedures under the hood (like the Google Container Registry), it is advised to complement Sarus with software provided by the container community like Skopeo (https://github.com/containers/skopeo). Skopeo is a highly versatile tool to work with container registries.
As an example, to download a GCR image requiring authentication credentials
into a tar file using the
skopeo copy command
$ skopeo copy –src-creds=<user>:<password> docker://gcr.io/<image>:<tag> docker-archive:<path>.tar
Skopeo should be able to authenticate either with username/password or with
access token (in this case enter
oauth2accesstoken as the username and
the token as the password in the
--src-creds option). More details are
available in the Skopeo documentation.
Once Skopeo has downloaded the image as a tar file, the image can be loaded into Sarus with the sarus load command.
Loading images from tar archives
If you do not have access to a remote Docker registry, or you’re uncomfortable with uploading your images to the cloud in order to pull them, Sarus offers the possibility to load images from tar archives generated by docker save.
First, save an image to a tar archive using Docker on your workstation:
$ docker save --output debian.tar debian:jessie $ ls -sh total 124M 124M debian.tar
Then, transfer the archive to the HPC system and use the sarus load command, followed by the archive filename and the descriptor you want to give to the Sarus image:
$ sarus images REPOSITORY TAG DIGEST CREATED SIZE SERVER $ srun sarus load ./debian.tar my_debian > expand image layers ... > extracting : /tmp/debian.tar/7e5c6402903b327fc62d1144f247c91c8e85c6f7b64903b8be289828285d502e/layer.tar > make squashfs ... > create metadata ... # created: <user home>/.sarus/images/load/library/my_debian/latest.squashfs # created: <user home>/.sarus/images/load/library/my_debian/latest.meta $ sarus images REPOSITORY TAG DIGEST CREATED SIZE SERVER load/library/my_debian latest 2fe79f06fa6d 2018-01-31T15:08:56 47.04MB load
The image is now ready to use. Notice that the origin server for the image has
load to indicate this image has been loaded from an archive.
Similarly to sarus pull, we recommend to load tar archives from
compute nodes. Should you run out of space while expanding the image,
sarus load also accepts the
--temp-dir option to specify an
alternative expansion directory.
As with images from 3rd party registries, to use or remove loaded images you need to enter the image descriptor (repository[:tag]) as displayed by the sarus images command in the first two columns.
To remove an image from Sarus’s local repository, use the sarus rmi command:
$ sarus images REPOSITORY TAG DIGEST CREATED SIZE SERVER library/debian latest 6bc9d2cd1831 2018-01-31T14:11:27 40.17MB index.docker.io $ sarus rmi debian:latest removed index.docker.io/library/debian/latest $ sarus images REPOSITORY TAG DIGEST CREATED SIZE SERVER
To remove images pulled from 3rd party registries or images loaded from local tar archives you need to enter the image descriptor (repository[:tag]) as displayed by the sarus images command:
$ sarus images REPOSITORY TAG DIGEST CREATED SIZE SERVER load/library/my_debian latest 2fe79f06fa6d 2018-01-31T15:08:56 47.04MB load $ sarus rmi load/library/my_debian removed load/library/my_debian/latest
Environment variables within containers are set by combining several sources, in the following order (later entries override earlier entries):
Host environment of the process calling Sarus
Environment variables defined in the container image, e.g., Docker ENV-defined variables
Modification of variables related to the NVIDIA Container Toolkit
Modifications (set/prepend/append/unset) specified by the system administrator in the Sarus configuration file. See here for details.
Environment variables defined using the
-e/--envoption of sarus run. The option can be passed multiple times, defining one variable per option. The first occurring
=(equals sign) character in the option value is treated as the separator between the variable name and its value:
$ srun sarus run -e SARUS_CONTAINER=true debian bash -c 'echo $SARUS_CONTAINER' SARUS_CONTAINER=true $ srun sarus run --env=CLI_VAR=cli_value debian bash -c 'echo $CLI_VAR' CLI_VAR=cli_value $ srun sarus run --env NESTED=innerName=innerValue debian bash -c 'echo $NESTED' NESTED=innerName=innerValue
=is not provided in the option value, Sarus considers the string as the variable name, and takes the value from the corresponding variable in the host environment. This can be used to override a variable set in the image with the value from the host. If no
=is provided and a matching variable is not found in the host environment, the option is ignored and the variable is not set in the container.
Accessing host directories from the container
System administrators can configure Sarus to automatically mount facilities like parallel filesystems into every container. Refer to your site documentation or system administrator to know which resources have been enabled on a specific system.
Mounting custom files and directories into the container
By default, Sarus creates the container filesystem environment from the image
and host system as specified by the system administrator.
It is possible to request additional paths from the host environment to be
mapped to some other path within the container using the
--mount option of
the sarus run command:
$ srun -N 1 --pty sarus run --mount=type=bind,source=/path/to/my/data,destination=/data -t debian bash
The previous command would cause
/path/to/my/data on the host to be mounted
/data within the container. This mount option can be specified multiple
times, one for each mount to be performed.
--mount accepts a comma-separated list of
<key>=<value> pairs as its
argument, much alike the Docker option with the same name (for reference, see
the official Docker documentation on bind mounts). As with Docker,
the order of the keys is not significant. A detailed breakdown of the possible
flags follows in the next subsections.
type: represents the type of the mount. Currently, only
bind(for bind-mounts) is supported.
source(required): Absolute path accessible from the user on the host that will be mounted in the container. Can alternatively be specified as
destination: Absolute path to where the filesystem will be made available inside the container. If the directory does not exist, it will be created. It is possible to overwrite other bind mounts already present in the container, however, the system administrator retains the power to disallow user-requested mounts to any location at their discretion. May alternatively be specified as
In addition to the mandatory flags, regular bind mounts can optionally add the following flag:
readonly(optional): Causes the filesystem to be mounted as read-only. This flag takes no value.
The following example demonstrates the use of a custom read-only bind mount.
$ ls -l /input_data drwxr-xr-x. 2 root root 57 Feb 7 10:49 ./ drwxr-xr-x. 23 root root 4096 Feb 7 10:49 ../ -rw-r--r--. 1 root root 1048576 Feb 7 10:49 data1.csv -rw-r--r--. 1 root root 1048576 Feb 7 10:49 data2.csv -rw-r--r--. 1 root root 1048576 Feb 7 10:49 data3.csv $ echo "1,2,3,4,5" > data4.csv $ srun -N 1 --pty sarus run --mount=type=bind,source=/input_data,destination=/input,readonly -t debian bash $ ls -l /input -rw-r--r--. 1 root 0 1048576 Feb 7 10:49 data1.csv -rw-r--r--. 1 root 0 1048576 Feb 7 10:49 data2.csv -rw-r--r--. 1 root 0 1048576 Feb 7 10:49 data3.csv -rw-r--r--. 1 root 0 10 Feb 7 10:52 data4.csv $ cat /input/data4.csv 1,2,3,4,5 $ touch /input/data5.csv touch: cannot touch '/input/data5.csv': Read-only file system $ exit
Bind-mounting FUSE filesystems into Sarus containers
By default, all FUSE filesystems are accessible only by the user who mounted them; this restriction is enforced by the kernel itself. Sarus is a privileged application setting up the container as the root user in order to perform some specific actions.
To allow Sarus to access a FUSE mount point on the host, in order to bind mount it into a container,
use the FUSE option
allow_root when creating the mount point.
For example, when creating an EncFS filesystem:
$ encfs -o allow_root --nocache $PWD/encfs.enc/ /tmp/encfs.dec/
$ sarus run -t --mount=type=bind,src=/tmp/encfs.dec,dst=/var/tmp/encfs ubuntu ls -l /var/tmp
It is possible to pass
allow_root if the option
user_allow_other is defined in
/etc/fuse.conf, as stated in the FUSE manpage.
Mounting custom devices into the container
Devices can be made available inside containers through the
of sarus run. The option can be entered multiple times, specifying
one device per option. By default, device files will be mounted into the
container at the same path they have on the host. A different destination path
can be entered using a colon (
:) as separator from the host path.
All paths used in the option value must be absolute:
$ srun sarus run --device=/dev/fuse debian ls -l /dev/fuse crw-rw-rw-. 1 root root 10, 229 Aug 17 17:54 /dev/fuse $ srun sarus run --device=/dev/fuse:/dev/container_fuse debian ls -l /dev/container_fuse crw-rw-rw-. 1 root root 10, 229 Aug 17 17:54 /dev/container_fuse
When working with device files, the
--mount option should not be used since
access to custom devices is disabled by default through the container’s device
--device option, on the other hand, will also whitelist the
requested devices in the cgroup, making them accessible within the container.
By default, the option will grant read, write and mknod permissions to devices;
this behavior can be controlled by adding a set of flags at the end of an option
value, still using a colon as separator. The flags must be a combination of the
rwm, standing for read, write and mknod access respectively;
the characters may come in any order, but must not be repeated. The access flags
can be entered independently from the presence of a custom destination path.
The full syntax of the option is thus
The following example shows how to mount a device with read-only access:
$ srun --pty sarus run -t --device=/dev/example:r debian bash $ echo "hello" > /dev/example bash: /dev/example: Operation not permitted $ exit
Sarus and the
--device option cannot grant more permissions to a device
than those which have been allowed on the host.
For example, if in the host a device is set for read-only access, then Sarus
cannot enable write or mknod access.
This is enforced by the implementation of device cgroups in the Linux kernel. For more details, please refer to the kernel documentation.
Image entrypoint and default arguments
Sarus fully supports image entrypoints and default arguments as defined by the OCI Image Specification.
The entrypoint of an image is a list of arguments that will be used as the command to execute when the container starts; it is meant to create an image that will behave like an executable file.
The image default arguments will be passed to the entrypoint if no argument is provided on the commmand line when launching a container. If the entrypoint is not present, the first default argument will be treated as the executable to run.
When creating container images with Docker, the entrypoint and default arguments are set using the ENTRYPOINT and CMD instructions respectively in the Dockerfile. For example, this file will generate an image printing arguments to the terminal by default:
FROM debian:stretch ENTRYPOINT ["/bin/echo"] CMD ["Hello world"]
After building such image (we’ll arbitrarily call it
echo) and importing it
into Sarus, we can run it without passing any argument:
$ srun sarus run <image repo>/echo Hello world
Entering a command line argument will override the default arguments passed to the entrypoint:
$ srun sarus run <image repo>/echo Foobar Foobar
The image entrypoint can be changed by providing a value to the
option of sarus run. It is important to note that when changing the
entrypoint the default arguments get discarded as well:
$ srun sarus run --entrypoint=cat <image repo>/echo /etc/os-release PRETTY_NAME="Debian GNU/Linux 9 (stretch)" NAME="Debian GNU/Linux" VERSION_ID="9" VERSION="9 (stretch)" ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/"
The entrypoint can be removed by passing an empty value to
is useful, for example, for inspecting and debugging containers:
$ srun --pty sarus run --entrypoint "" -t <image repo>/echo bash $ env | grep ^PATH PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin $ exit
Using the “adjacent value” style to remove an image entrypoint
--entrypoint="") is not supported. Please pass the empty string
value separated by a whitespace.
The working directory inside the container can be controlled using the
-w/--workdir option of the sarus run command:
$ srun -N 1 --pty sarus run --workdir=/path/to/workdir -t debian bash
If the path does not exist, it is created inside the container.
-w/--workdir option is not specified but the image defines
a working directory, the container process will start
there. Otherwise, the process will start in the container’s root directory (
Using image-defined working directories can be useful, for example, for
simplifying the command line when launching containers.
When creating images with Docker, the working directory is set using the WORKDIR instruction in the Dockerfile.
The PID namespace for the container can be controlled through the
option of sarus run. Currently, the supported values for the option
host: use the host’s PID namespace for the container. This allows to transparently support MPI implementations relying on the ranks having different PIDs when running on the same physical host, or using shared memory technologies like Cross Memory Attach (CMA); (default)
private: create a new PID namespace for the container. Having a private PID namespace can be also referred as “using PID namespace isolation” or simply “using PID isolation”.
$ sarus run --pid=private alpine:3.14 ps -o pid,comm PID COMMAND 1 ps
Consider using an init process when running with a private PID namespace if you need to handle signals or run many processes into the container.
Adding an init process to the container
By default, Sarus only executes the user-specified application within the container. When using a private PID namespace, the container process is assigned PID 1 in the new namespace. The PID 1 process has unique features in Linux: most notably, the process will ignore signals by default and zombie processes will not be reaped inside the container (see  ,  for further reference).
If you need to handle signals or reap zombie processes (which can be useful when
executing several different processes in long-running containers), you can use the
--init option to run an init process inside the container:
$ srun -N 1 sarus run --pid=private --init alpine:3.14 ps -o pid,comm PID COMMAND 1 init 7 ps
Sarus uses tini as its default init process.
Some HPC applications may be subject to performance losses when run with an init process. Our internal benchmarking tests with tini showed overheads of up to 2%.
Verbosity levels and help messages
To run a command in verbose mode, enter the
--verbose global option before
$ srun sarus --verbose run debian:latest cat /etc/os-release
To run a command printing extensive details about internal workings, enter the
option before the command:
$ srun sarus --debug run debian:latest cat /etc/os-release
To print a general help message about Sarus, use
To print information about a command (e.g. command-specific options), use
sarus help <command>:
$ sarus help run Usage: sarus run [OPTIONS] REPOSITORY[:TAG] [COMMAND] [ARG...] Run a command in a new container Note: REPOSITORY[:TAG] has to be specified as displayed by the "sarus images" command. Options: --centralized-repository Use centralized repository instead of the local one -t [ --tty ] Allocate a pseudo-TTY in the container --entrypoint arg Overwrite the default ENTRYPOINT of the image --mount arg Mount custom directories into the container -m [ --mpi ] Enable MPI support --ssh Enable SSH in the container
Support for container customization through hooks
Sarus allows containers to be customized by other programs or scripts leveraging the interface defined by the Open Container Initiative Runtime Specification for POSIX-platform hooks (OCI hooks for short). These customizations are especially amenable to HPC use cases, where the dedicated hardware and highly-tuned software adopted by high-performance systems are in contrast with the infrastructure-agnostic nature of software containers. OCI hooks provide a solution to open access to these resources inside containers.
The hooks that will be enabled on a given Sarus installation are configured by the system administrators. Please refer to your site documentation or your system administrator to know which hooks are available in a specific system and how to activate them.
Here we will illustrate a few cases of general interest for HPC from an end-user perspective.
Native MPI support (MPICH-based)
Sarus comes with a hook able to import native MPICH-based MPI implementations inside the container. This is useful in case the host system features a vendor-specific or high-performance MPI stack based on MPICH (e.g. Intel MPI, Cray MPT, MVAPICH) which is required to fully leverage a high-speed interconnect.
To take advantage of this feature, the MPI installed in the container (and dynamically linked to your application) needs to be ABI-compatible with the MPI on the host system. Taking as an example the Piz Daint Cray XC50 supercomputer at CSCS, to best meet the required ABI compatibility we recommend that the container application uses one of the following MPI implementations:
The following is an example Dockerfile to create a Debian image with MPICH 3.1.4:
FROM debian:jessie RUN apt-get update && apt-get install -y \ build-essential \ wget \ --no-install-recommends \ && rm -rf /var/lib/apt/lists/* RUN wget -q http://www.mpich.org/static/downloads/3.1.4/mpich-3.1.4.tar.gz \ && tar xf mpich-3.1.4.tar.gz \ && cd mpich-3.1.4 \ && ./configure --disable-fortran --enable-fast=all,O3 --prefix=/usr \ && make -j$(nproc) \ && make install \ && ldconfig \ && cd .. \ && rm -rf mpich-3.1.4 \ && rm mpich-3.1.4.tar.gz
Applications that are statically linked to MPI libraries will not work with the native MPI support provided by the hook.
Once the system administrator has configured the hook, containers with native
MPI support can be launched by passing the
--mpi option to the
sarus run command, e.g.:
$ srun -N 16 -n 16 sarus run --mpi <repo name>/<image name> <mpi_application>
NVIDIA GPU support
NVIDIA provides access to GPU devices and their driver stacks inside OCI containers through the NVIDIA Container Toolkit hook .
When Sarus is configured to use this hook, the GPU devices to be made
available inside the container can be selected by setting the
CUDA_VISIBLE_DEVICES environment variable in the host system. Such action is
often performed automatically by the workload manager or other site-specific
software (e.g. the SLURM workload manager sets
GPUs are requested via the Generic Resource Scheduling plugin). Be sure to check the setup provided by
your computing site.
The container image needs to include a CUDA runtime that is suitable for both the target container applications and the available GPUs in the host system. One way to achieve this is by basing your image on one of the official Docker images provided by NVIDIA, i.e. the Dockerfile should start with a line like this:
To check all the available CUDA images provided by NVIDIA, visit https://hub.docker.com/r/nvidia/cuda/
When developing GPU-accelerated images on your workstation, we recommend using nvidia-docker to run and test containers using an NVIDIA GPU.
SSH connection within containers
Sarus also comes with a hook which enables support for SSH connections within containers.
When Sarus is configured to use this hook, you must first run the command
sarus ssh-keygen to generate the SSH keys that will be used by the SSH
daemons and the SSH clients in the containers. It is sufficient to generate
the keys just once, as they are persistent between sessions.
It is then possible to execute a container passing the
--ssh option to
sarus run, e.g.
sarus run --ssh <image> <command>. Using the
previously generated the SSH keys, the hook will instantiate an sshd and setup a
ssh binary inside each container created with the same command.
Within a container spawned with the
--ssh option, it is possible to connect
into other containers by simply issuing the
ssh command available in the
ssh <hostname of other node>
ssh binary will take care of using the proper keys and
non-standard port in order to connect to the remote container.
ssh program is called without a command argument, it will open
a login shell into the remote container. In this situation, the SSH hook attempts
to reproduce the environment variables which were defined upon the launch
of the remote container. The aim is to replicate the experience of actually
accessing a shell in the container as it was created.
The SSH hook currently does not implement a poststop functionality and
requires the use of a private PID namespace to cleanup the Dropbear daemon.
Thus, the hook currently requires the use of a private PID namespace
for the container. Thus, the
--ssh option of sarus run implies
--pid=private, and is incompatible with the use of
OpenMPI communication through SSH
The MPICH-based MPI hook described above does not support OpenMPI libraries. As an alternative, OpenMPI programs can communicate through SSH connections created by the SSH hook.
To run an OpenMPI program using the SSH hook, we need to manually provide a list
of hosts and explicitly launch
mpirun only on one node of the allocation.
We can do so with the following commands:
salloc -C gpu -N4 -t5 srun hostname > $SCRATCH/hostfile srun sarus run --ssh \ --mount=src=/users,dst=/users,type=bind \ --mount=src=$SCRATCH,dst=$SCRATCH,type=bind \ ethcscs/openmpi:3.1.3 \ bash -c 'if [ $SLURM_PROCID -eq 0 ]; then mpirun --hostfile $SCRATCH/hostfile -npernode 1 /openmpi-3.1.3/examples/hello_c; else sleep 10; fi'
Upon establishing a remote connection, the SSH hook provides a
$PATH covering the
most used default locations:
If your OpenMPI installation uses a custom location, consider using an absolute
mpirun and the
--prefix option as advised in the
official OpenMPI FAQ.
Sarus’s source code includes a hook able to inject glibc libraries from the host inside the container, overriding the glibc of the container.
This is useful when injecting some host resources (e.g. MPI libraries) into the container and said resources depend on a newer glibc than the container’s one.
The host glibc stack to be injected is configured by the system administrator.
If Sarus is configured to use this hook, the glibc replacement can be activated
by passing the
--glibc option to sarus run. Since native MPI
support is the most common occurrence of host resources injection, the hook is
also implicitly activated when using the
Even when the hook is configured and activated, the glibc libraries in the container will only be replaced if the following conditions apply:
the container’s libraries are older than the host’s libraries;
host and container glibc libraries have the same soname and are ABI compatible.
Running MPI applications without the native MPI hook
The MPI replacement mechanism controlled by the
option is not mandatory to run distributed applications in Sarus containers. It
is possible to run containers using the MPI implementation embedded in the image,
foregoing the performance of custom high-performance hardware.
This can be useful in a number of scenarios:
the software stack in the container should not be altered in any way
impossibility to satisfy ABI compatibility for native hardware acceleration
Sarus can be launched as a normal MPI program, and the execution context will be propagated to the container application:
mpiexec -n <number of ranks> sarus run <image> <MPI application>
The important aspect to consider is that the process management system from the host must be able to communicate with the MPI libraries in the container.
MPICH-based MPI implementations by default use the Hydra process manager and the PMI-2 interface to communicate between processes. OpenMPI by default uses the OpenRTE (ORTE) framework and the PMIx interface, but can be configured to support PMI-2.
Additional information about the support provided by PMIx for containers and cross-version use cases can be found here: https://openpmix.github.io/support/faq/how-does-pmix-work-with-containers
As a general rule of thumb, corresponding MPI implementations (e.g. using an
mpiexec on the host to launch Sarus containers featuring
MPICH libraries) should work fine together. As mentioned previously, if an
OpenMPI library in the container has been configured to support PMI, the
container should also be able to communicate with an MPICH-compiled
from the host.
The following is a minimal Dockerfile example of building OpenMPI 4.0.2 with PMI-2 support on Ubuntu 18.04:
FROM ubuntu:18.04 RUN apt-get update && apt-get install -y \ build-essential \ ca-certificates \ automake \ autoconf \ libpmi2-0-dev \ wget \ --no-install-recommends \ && rm -rf /var/lib/apt/lists/* RUN wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.2.tar.gz \ && tar xf openmpi-4.0.2.tar.gz \ && cd openmpi-4.0.2 \ && ./configure --prefix=/usr --with-pmi=/usr/include/slurm-wlm --with-pmi-libdir=/usr/lib/x86_64-linux-gnu CFLAGS=-I/usr/include/slurm-wlm \ && make -j$(nproc) \ && make install \ && ldconfig \ && cd .. \ && rm -rf openmpi-4.0.2.tar.gz openmpi-4.0.2
When running under the Slurm workload manager, the process management interface
can be selected with the
--mpi option to
srun. The following example
shows how to run the OSU point-to-point latency test
from the Sarus cookbook on CSCS’ Piz Daint Cray XC50 system without native
$ srun -C gpu -N2 -t2 --mpi=pmi2 sarus run ethcscs/mpich:ub1804_cuda92_mpi314_osu ./osu_latency ###MPI-3.0 # OSU MPI Latency Test v5.6.1 # Size Latency (us) 0 6.82 1 6.80 2 6.80 4 6.75 8 6.79 16 6.86 32 6.82 64 6.82 128 6.85 256 6.87 512 6.92 1024 9.77 2048 10.75 4096 11.32 8192 12.17 16384 14.08 32768 17.20 65536 29.05 131072 57.25 262144 83.84 524288 139.52 1048576 249.09 2097152 467.83 4194304 881.02
Notice that an
--mpi=pmi2 option was passed to
srun but not to