Mount hook

The source code of Sarus includes a hook able to perform an arbitrary sequence of bind mounts and device mounts (including device whitelisting in the related cgroup) into a container.

When activated, the hook enters the mount namespace of the container and performs the mounts it received as CLI arguments. The formats for such arguments are the same as those for the --mount and --device options of sarus run. This design choice has several advantages:

  • Reuses established formats adopted by popular container tools.

  • Explicitly states mount details and provides increased clarity compared to lists separated by colons or semicolons (for example, path lists used by other hooks).

  • Reduces the effort to go from experimentation to consolidation: the mounts for a feature can be explored and prototyped on the Sarus command line, then transferred directly into a Mount hook configuration file.

In effect, the hook produces the same outcome as entering its --mount and --device option arguments in the command line of Sarus (or other engines with a similar CLI, like Docker and Podman).

However, the hook provides a way to group together the definition and execution of mounts related to a specific feature. By doing so, the feature complexity is abstracted from the user and feature activation becomes either comfortable (e.g. via a single CLI option) or completely transparent (e.g. if the hook either is always active or if it relies on an OCI annotation from the container image). Some example use cases are described in this section.

Note

Compared to the MPI or Glibc hooks, the Mount hook does not check ABI or version compatibility of mounted resources, and it does not deduce on its own the mount destination paths within the container, since its purpose is not strictly tied to replacing library stacks.

Hook installation

The hook is written in C++ and it will be compiled when building Sarus without the need of additional dependencies. Sarus' installation scripts will also automatically install the hook in the $CMAKE_INSTALL_PREFIX/bin directory. In short, no specific action is required to install the MPI hook.

Hook configuration

The program is meant to be run as a prestart hook and accepts option arguments with the same formats as the --mount or --device options of sarus run.

The hook also supports the following environment variables:

  • LDCONFIG_PATH (optional): Absolute path to a trusted ldconfig program on the host. If set, the program at the path is used to update the container's dynamic linker cache after performing the mounts.

The following is an example of OCI hook JSON configuration file enabling the MPI hook:

{
    "version": "1.0.0",
    "hook": {
        "path": "/opt/sarus/bin/mount_hook",
        "args": ["mount_hook",
            "--mount=type=bind,src=/usr/lib/libexample.so.1,dst=/usr/local/lib/libexample.so.1",
            "--mount=type=bind,src=/etc/configs,dst=/etc/host_configs,readonly",
            "--device=/dev/example:rw"
        ],
        "env": [
            "LDCONFIG_PATH=/sbin/ldconfig"
        ]
    },
    "when": {
        "always": true
    },
    "stages": ["prestart"]
}

Example use cases

Libfabric provider injection

Libfabric is a communication framework which can be used as a middleware to abstract network hardware from an MPI implementation. Access to different fabrics is enabled through dedicated software components, which are called libfabric providers.

Fabric provider injection [1] consists in bind mounting a dynamically-linked provider and its dependencies into a container, so that containerized applications can access a high-performance fabric which is not supported in the original container image. For a formal introduction, evaluation, and discussion about the advantages of this approach, please refer to the reference publication.

To facilitate the implementation of fabric provider injection, the Mount hook supports the <FI_PROVIDER_PATH> wildcard (angle brackets included) in --mount arguments. FI_PROVIDER_PATH is an environment variable recognized by libfabric itself, which can be used to control the path where libfabric searches for external, dynamically-linked providers. The wildcard is recognized by the hook during the acquisition of CLI arguments, and is substituted with a path obtained through the following conditions:

  • If the FI_PROVIDER_PATH environment variable exists within the container, its value is taken.

  • If FI_PROVIDER_PATH is unset or empty in the container's environment, and the LDCONFIG_PATH variable is configured for the hook, then the hook searches for a libfabric library in the container's dynamic linker cache, and obtains its installation path. The wildcard value is then set to "libfabric library install path"/libfabric, which is the default search path used by libfabric. For example, if libfabric is located at /usr/lib64/libfabric.so.1, the wildcard value will be /usr/lib64/libfabric.

  • If it's not possible to determine a value with the previous methods, the wildcard value is set to /usr/lib.

The following is an example of hook configuration file using the wildcard to perform the injection of the GNI provider, enabling access to the Cray Aries high-speed interconnect on a Cray XC50 supercomputer:

{
    "version": "1.0.0",
    "hook": {
        "path": "/opt/sarus/default/bin/mount_hook",
        "args": ["mount_hook",
            "--mount=type=bind,src=/usr/local/libfabric/1.18.0/lib/libfabric/libgnix-fi.so,dst=<FI_PROVIDER_PATH>/libgnix-fi.so",
            "--mount=type=bind,src=/opt/cray/xpmem/default/lib64/libxpmem.so.0,dst=/usr/lib/libxpmem.so.0",
            "--mount=type=bind,src=/opt/cray/ugni/default/lib64/libugni.so.0,dst=/usr/lib64/libugni.so.0",
            "--mount=type=bind,src=/opt/cray/udreg/default/lib64/libudreg.so.0,dst=/usr/lib64/libudreg.so.0",
            "--mount=type=bind,src=/opt/cray/alps/default/lib64/libalpsutil.so.0,dst=/usr/lib64/libalpsutil.so.0",
            "--mount=type=bind,src=/opt/cray/alps/default/lib64/libalpslli.so.0,dst=/usr/lib64/libalpslli.so.0",
            "--mount=type=bind,src=/opt/cray/wlm_detect/default/lib64/libwlm_detect.so.0,dst=/usr/lib64/libwlm_detect.so.0",
            "--mount=type=bind,src=/var/opt/cray/alps,dst=/var/opt/cray/alps",
            "--mount=type=bind,src=/etc/opt/cray/wlm_detect,dst=/etc/opt/cray/wlm_detect",
            "--mount=type=bind,src=/opt/gcc/10.3.0/snos/lib64/libatomic.so.1,dst=/usr/lib/libatomic.so.1",
            "--device=/dev/kgni0",
            "--device=/dev/kdreg",
            "--device=/dev/xpmem"
        ],
        "env": [
            "LDCONFIG_PATH=/sbin/ldconfig"
        ]
    },
    "when": {
        "annotations": {
            "^com.hooks.mpi.enabled$": "^true$",
            "^com.hooks.mpi.type$": "^libfabric$"
        }
    },
    "stages": ["prestart"]
}

Accessing a host Slurm WLM from inside a container

The Slurm workload manager from the host system can be exposed within containers through a set of bind mounts. Doing so enables containers to submit new allocations and jobs to the cluster, opening up the possibility for more articulated workflows.

The key components to bind mount are the binaries for Slurm commands, the host Slurm configuration, the MUNGE socket, and any related dependencies. Below you may find an example of hook configuration file enabling access to the host Slurm WLM on a Cray XC50 system at CSCS:

{
    "version": "1.0.0",
    "hook": {
        "path": "/opt/sarus/default/bin/mount_hook",
        "args": ["mount_hook",
            "--mount=type=bind,src=/usr/bin/salloc,dst=/usr/bin/salloc",
            "--mount=type=bind,src=/usr/bin/sbatch,dst=/usr/bin/sbatch",
            "--mount=type=bind,src=/usr/bin/sinfo,dst=/usr/bin/sinfo",
            "--mount=type=bind,src=/usr/bin/squeue,dst=/usr/bin/squeue",
            "--mount=type=bind,src=/usr/bin/srun,dst=/usr/bin/srun",
            "--mount=type=bind,src=/etc/slurm,dst=/etc/slurm",
            "--mount=type=bind,src=/usr/lib64/slurm,dst=/usr/lib64/slurm",
            "--mount=type=bind,src=/var/run/munge,destination=/run/munge",
            "--mount=type=bind,src=/usr/lib64/libmunge.so.2,dst=/usr/lib64/libmunge.so.2",
            "--mount=type=bind,src=/opt/cray/alpscomm/default/lib64/libalpscomm_sn.so.0,dst=/usr/lib64/libalpscomm_sn.so.0",
            "--mount=type=bind,src=/opt/cray/alpscomm/default/lib64/libalpscomm_cn.so.0,dst=/usr/lib64/libalpscomm_cn.so.0",
            "--mount=type=bind,src=/opt/cray/swrap/default/lib64/libswrap.so.0,dst=/usr/lib64/libswrap.so.0",
            "--mount=type=bind,src=/opt/cray/socketauth/default/lib64/libsocketauth.so.0,dst=/usr/lib64/libsocketauth.so.0",
            "--mount=type=bind,src=/opt/cray/comm_msg/default/lib64/libcomm_msg.so.0,dst=/usr/lib64/libcomm_msg.so.0",
            "--mount=type=bind,src=/opt/cray/sysadm/default/lib64/libsysadm.so.0,dst=/usr/lib64/libsysadm.so.0",
            "--mount=type=bind,src=/opt/cray/codbc/default/lib64/libcodbc.so.0,dst=/usr/lib64/libcodbc.so.0",
            "--mount=type=bind,src=/opt/cray/nodeservices/default/lib64/libnodeservices.so.0,dst=/usr/lib64/libnodeservices.so.0",
            "--mount=type=bind,src=/opt/cray/sysutils/default/lib64/libsysutils.so.0,dst=/usr/lib64/libsysutils.so.0",
            "--mount=type=bind,src=/opt/cray/pe/atp/libAtpDispatch.so,dst=/opt/cray/pe/atp/libAtpDispatch.so",
            "--mount=type=bind,src=/opt/cray/pe/atp/3.14.5/slurm/libAtpSLaunch.so,dst=/opt/cray/pe/atp/3.14.5/slurm/libAtpSLaunch.so",
            "--mount=type=bind,src=/usr/lib64/libxmlrpc-epi.so.0,dst=/usr/lib64/libxmlrpc-epi.so.0",
            "--mount=type=bind,src=/usr/lib64/libodbc.so.2,dst=/usr/lib64/libodbc.so.2",
            "--mount=type=bind,src=/usr/lib64/libexpat.so.1,dst=/usr/lib64/libexpat.so.1",
            "--mount=type=bind,src=/usr/lib64/libltdl.so.7,dst=/usr/lib64/libltdl.so.7",
            "--mount=type=bind,src=/opt/cray/job/default/lib64/libjob.so.0,dst=/usr/lib64/libjob.so.0",
            "--mount=type=bind,src=/opt/cray/job/default/lib64/libjobctl.so.0,dst=/usr/lib64/libjobctl.so.0",
            "--mount=type=bind,src=/opt/cray/ugni/default/lib64/libugni.so.0,dst=/usr/lib64/libugni.so.0",
            "--mount=type=bind,src=/usr/lib64/libjansson.so.4,dst=/usr/lib64/libjansson.so.4",
            "--mount=type=bind,src=/opt/cscs/jobreport/jobreport.so,dst=/opt/cscs/jobreport/jobreport.so",
            "--mount=type=bind,src=/opt/cscs/nohome/nohome.so,dst=/opt/cscs/nohome/nohome.so",
            "--mount=type=bind,src=/usr/lib64/libslurm.so.36,dst=/usr/lib64/libslurm.so.36",
            "--mount=type=bind,src=/usr/lib64/libcurl.so.4,dst=/usr/lib64/libcurl.so.4",
            "--mount=type=bind,src=/usr/lib64/libnghttp2.so.14,dst=/usr/lib64/libnghttp2.so.14",
            "--mount=type=bind,src=/usr/lib64/libssh.so.4,dst=/usr/lib64/libssh.so.4",
            "--mount=type=bind,src=/usr/lib64/libpsl.so.5,dst=/usr/lib64/libpsl.so.5",
            "--mount=type=bind,src=/usr/lib64/libssl.so.1.1,dst=/usr/lib64/libssl.so.1.1",
            "--mount=type=bind,src=/usr/lib64/libcrypto.so.1.1,dst=/usr/lib64/libcrypto.so.1.1",
            "--mount=type=bind,src=/usr/lib64/libldap_r-2.4.so.2,dst=/usr/lib64/libldap_r-2.4.so.2",
            "--mount=type=bind,src=/usr/lib64/liblber-2.4.so.2,dst=/usr/lib64/liblber-2.4.so.2",
            "--mount=type=bind,src=/usr/lib64/libsasl2.so.3,dst=/usr/lib64/libsasl2.so.3",
            "--mount=type=bind,src=/usr/lib64/libyaml-0.so.2,dst=/usr/lib64/libyaml-0.so.2"
        ],
        "env": [
            "LDCONFIG_PATH=/sbin/ldconfig"
        ]
    },
    "when": {
        "annotations": {
            "^com.hooks.slurm.activate$": "^true$"
        }
    },
    "stages": ["prestart"]
}

The following is an example usage of the hook as configured above:

$ srun --pty sarus run --annotation=com.hooks.slurm.activate=true -t debian:11 bash

nid00040:/$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

nid00040:/$ srun --version
slurm 20.11.8

nid00040:/$ squeue -u <username>
   JOBID        USER    ACCOUNT   NAME  ST REASON  START_TIME  TIME  TIME_LEFT  NODES  CPUS
  714067  <username>  <account>  sarus   R None      12:48:41  0:40      59:20      1    24

nid00040:/$ salloc -N4 /bin/bash
salloc: Waiting for resource configuration
salloc: Nodes nid0000[0-3] are ready for job

nid00040:/$ srun -n4 hostname
nid00002
nid00003
nid00000
nid00001

# exit from inner Slurm allocation
nid00040:/$ exit
# exit from the container
nid00040:/$ exit

References