Mamba Server - norlab-ulaval/Norlab_wiki GitHub Wiki
Running jobs
As users do not have root access on the Mamba Server, every project should be ran in a container. We recommend using podman.
First, on your host machine, write a Dockerfile
to run your project inside a container.
Then, build and test that everything works on your machine before testing it on the server.
We recommend putting your data in a directory and to symlink it to the data
folder of your project.
We describe here how to add volumes to avoid copying the data in the container.
# Build the image
buildah build --layers -t myproject .
# Test variables
export CUDA_VISIBLE_DEVICES=0 # or `0,1` for specific GPUs, will be automatically set by SLURM
# Run the image
podman run --gpus=all -e SLURM_JOB_ID=$SLURM_JOB_ID -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
-v .:/app/ \
-v /app/data \
-v ./data/coco/:/app/data/coco \
myproject bash -c "python3 train.py --gpu $CUDA_VISIBLE_DEVICES"
After you verified everything works on your machine, copy the code on the server and write a Slurm job script.
#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=16
#SBATCH --time=10-00:00
#SBATCH --job-name=job_name
#SBATCH --output=%x-%j.out
cd ~/myproject || exit
buildah build --layers -t myproject .
# Run the image
podman run --gpus=all -e SLURM_JOB_ID=$SLURM_JOB_ID -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
-v .:/app/ \
-v /app/data \
-v ./data/coco/:/app/data/coco \
myproject bash -c "python3 train.py --gpu $CUDA_VISIBLE_DEVICES"
You can then run the job using:
sbatch job.sh
And see the running jobs using:
squeue
sjm
For an easier experience, you can use willGuimont/sjm. It allows to create temporary scripts easily and run them on a remote SLURM server.
Remote connection
-
SSH and X11 forwarding with GLX support
First make sure X11 forwarding is enabled server side. Check these two lines in
/etc/ssh/sshd_config
X11Forwarding yes X11DisplayOffset 10
If they were not, enable them and restart sshd
sudo systemctl restart sshd
Make sure
xauth
is available on the server or install itsudo dnf install xorg-x11-xauth
Next install basic utilities to test GLX and Vulkan capabilities on the server. We'll need them to benchmark the remote connection's performance.
sudo dnf install glx-utils vulkan-tools
If you encounter problems, make sure on the client-side the server is allowed to display.
Make sure the ip of the server is valid. + to add - to remove from the trusted list.xhost + 132.203.26.231
Connect to the server from your client using ssh.
Use the options -X or -Y to redirect X11 via the ssh tunnel. The redirection works despite the fact that the server is headless. But Xorg must be installed.
The -X option will automatically update the DSIPLAY env variable. Note: the ip of the server is subject to change, make sure to have the last updated.ssh -X [email protected]
Test that X redirection is working by executing a simple X graphical application.
$ xterm
Test GLX support with
glxinfo
glxinfo
Test what GLX implementation is used by default
$ glxinfo | grep -i vendor server glx vendor string: SGI client glx vendor string: Mesa Project and SGI Vendor: Mesa (0xffffffff) OpenGL vendor string: Mesa
Check both NVidia and Mesa implementations work for GLX passthrough.
__GLX_VENDOR_LIBRARY_NAME=nvidia glxinfo | grep -i vendor __GLX_VENDOR_LIBRARY_NAME=mesa glxinfo | grep -i vendor
Choose the best implementation between Nvidia and Mesa.
On Nvidia GPUs NVidia's implementation gives the best results.export __GLX_VENDOR_LIBRARY_NAME=nvidia glxgears
For Vulkan aplications the process is similar
vulkaninfo VK_DRIVER_FILES="/usr/share/vulkan/icd.d/nvidia_icd.x86_64.json" vkcube