Frequently asked questions - NVIDIA/pyxis GitHub Wiki
What is PMIx?
Check the Slurm MPI Users Guide, Slurm is responsible for launching the tasks and mpirun
is not needed.
PMIx?
How do I configure pyxis for multi-node workloads throughMake sure you configure enroot with the extra PMIx hook, as described in enroot configuration. It it doesn't work, check slurmd configuration
MPI_Init
sometimes failing under PMIx?
Why is Under a PMIx allocation, i.e. srun --mpi=pmix
, you can only do a single MPI_Init
. In other words, you can't have srun
execute a script that launches multiple MPI applications in sequence.
Instead, you can save the container state with --container-name
and then do multiple invocations of srun
, one for each MPI application:
# From the login-node:
$ salloc -N2
$ srun --container-name=tf --container-image=tensorflow bash -c 'apt-get update && apt-get install -y ...'
$ srun --mpi=pmix --container-name=tf mpiapp1 ....
$ srun --mpi=pmix --container-name=tf mpiapp2 ....
Are there any known limitations when using PMIx under Slurm?
Under a PMIx allocation, you can only do a single MPI_Init
(see above).
In addition, MPI_Comm_spawn
is known to not be available with PMIx and Slurm.
srun
?
Why am I not seeing the pyxis output when using This is a known issue in older versions of Slurm when using srun --pty
. We recommend using at least Slurm 20.02.5 and pyxis 0.8.1 to solve this problem.
sbatch
?
Can I use pyxis arguments with You can do sbatch --container-image
with pyxis 0.12. It will run the sbatch script inside the container therefore you will not be able to use srun
from within the containerized sbatch
script.
-p
/--publish
from Docker?
Is there an equivalent to Enroot does not create a network namespace for the container, so you don't need to "publish" ports like with Docker, it's no different than running outside the container or --network=host
, but as a user you won't be able to listen on privileged ports (ports 1 to 1023 by default)
--export
argument of Slurm?
Why am I getting errors when using the For example, with ENROOT_RUNTIME_PATH ${XDG_RUNTIME_DIR}/enroot
in enroot.conf
:
$ srun --export NVIDIA_VISIBLE_DEVICES=0 --container-image nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi -L
pyxis: importing docker image: nvidia/cuda:12.4.1-base-ubuntu22.04
slurmstepd: error: pyxis: child 1692947 failed with error code: 1
slurmstepd: error: pyxis: failed to import docker image
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis: /usr/bin/enroot: line 44: HOME: unbound variable
slurmstepd: error: pyxis: /usr/bin/enroot: line 44: XDG_RUNTIME_DIR: unbound variable
slurmstepd: error: pyxis: mkdir: cannot create directory '/run/enroot': Permission denied
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: child 1692966 failed with error code: 1
In this case, the issue is that --export
will unset all other environment variables from the user environment, and only set NVIDIA_VISIBLE_DEVICES=0
. It is recommended to add the ALL
option when using --export
:
$ srun --export ALL,NVIDIA_VISIBLE_DEVICES=0 --container-image nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi -L
pyxis: importing docker image: nvidia/cuda:12.4.1-base-ubuntu22.04
pyxis: imported docker image: nvidia/cuda:12.4.1-base-ubuntu22.04
GPU 0: NVIDIA GeForce RTX 3070 (UUID: GPU-acce903c-39ee-787e-3dbc-f1d82df43fe7)
This behavior can be surprising for users familiar with Docker, as the --export
argument of Slurm does not behave like the --env
argument of Docker Engine.