Mamba Server - norlab-ulaval/Norlab_wiki GitHub Wiki
Running jobs
As users do not have root access on the Mamba Server, every project should be ran in a container. We recommend using podman.
First, on your host machine, write a Dockerfile to run your project inside a container.
Then, build and test that everything works on your machine before testing it on the server.
We recommend putting your data in a directory and to symlink it to the data folder of your project.
We describe here how to add volumes to avoid copying the data in the container.
# Build the image
buildah build --layers -t myproject .
# Test variables
export CUDA_VISIBLE_DEVICES=0 # or `0,1` for specific GPUs, will be automatically set by SLURM
# Run the image
podman run --gpus=all -e SLURM_JOB_ID=$SLURM_JOB_ID -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
-v .:/app/ \
-v /app/data \
-v ./data/coco/:/app/data/coco \
myproject bash -c "python3 train.py --gpu $CUDA_VISIBLE_DEVICES"
After you verified everything works on your machine, copy the code on the server and write a Slurm job script.
#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=16
#SBATCH --time=10-00:00
#SBATCH --job-name=job_name
#SBATCH --output=%x-%j.out
cd ~/myproject || exit
buildah build --layers -t myproject .
# Run the image
podman run --gpus=all -e SLURM_JOB_ID=$SLURM_JOB_ID -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
-v .:/app/ \
-v /app/data \
-v ./data/coco/:/app/data/coco \
myproject bash -c "python3 train.py --gpu $CUDA_VISIBLE_DEVICES"
You can then run the job using:
sbatch job.sh
And see the running jobs using:
squeue
sjm
For an easier experience, you can use willGuimont/sjm. It allows to create temporary scripts easily and run them on a remote SLURM server.
Remote connection
-
SSH and X11 forwarding with GLX support
First make sure X11 forwarding is enabled server side. Check these two lines in
/etc/ssh/sshd_configX11Forwarding yes X11DisplayOffset 10If they were not, enable them and restart sshd
sudo systemctl restart sshdMake sure
xauthis available on the server or install itsudo dnf install xorg-x11-xauthNext install basic utilities to test GLX and Vulkan capabilities on the server. We'll need them to benchmark the remote connection's performance.
sudo dnf install glx-utils vulkan-toolsIf you encounter problems, make sure on the client-side the server is allowed to display.
Make sure the ip of the server is valid. + to add - to remove from the trusted list.xhost + 132.203.26.231Connect to the server from your client using ssh.
Use the options -X or -Y to redirect X11 via the ssh tunnel. The redirection works despite the fact that the server is headless. But Xorg must be installed.
The -X option will automatically update the DSIPLAY env variable. Note: the ip of the server is subject to change, make sure to have the last updated.ssh -X [email protected]Test that X redirection is working by executing a simple X graphical application.
$ xtermTest GLX support with
glxinfoglxinfoTest what GLX implementation is used by default
$ glxinfo | grep -i vendor server glx vendor string: SGI client glx vendor string: Mesa Project and SGI Vendor: Mesa (0xffffffff) OpenGL vendor string: MesaCheck both NVidia and Mesa implementations work for GLX passthrough.
__GLX_VENDOR_LIBRARY_NAME=nvidia glxinfo | grep -i vendor __GLX_VENDOR_LIBRARY_NAME=mesa glxinfo | grep -i vendorChoose the best implementation between Nvidia and Mesa.
On Nvidia GPUs NVidia's implementation gives the best results.export __GLX_VENDOR_LIBRARY_NAME=nvidia glxgearsFor Vulkan aplications the process is similar
vulkaninfo VK_DRIVER_FILES="/usr/share/vulkan/icd.d/nvidia_icd.x86_64.json" vkcube