Mamba Server - norlab-ulaval/Norlab_wiki GitHub Wiki

Running jobs

As users do not have root access on the Mamba Server, every project should be ran in a container. We recommend using podman.

First, on your host machine, write a Dockerfile to run your project inside a container. Then, build and test that everything works on your machine before testing it on the server.

We recommend putting your data in a directory and to symlink it to the data folder of your project. We describe here how to add volumes to avoid copying the data in the container.

# Build the image
buildah build --layers -t myproject .

# Test variables
export CUDA_VISIBLE_DEVICES=0 # or `0,1` for specific GPUs, will be automatically set by SLURM

# Run the image
podman run --gpus=all -e SLURM_JOB_ID=$SLURM_JOB_ID -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
  -v .:/app/ \
  -v /app/data \
  -v ./data/coco/:/app/data/coco \
  myproject bash -c "python3 train.py --gpu $CUDA_VISIBLE_DEVICES"

After you verified everything works on your machine, copy the code on the server and write a Slurm job script.

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=16
#SBATCH --time=10-00:00
#SBATCH --job-name=job_name
#SBATCH --output=%x-%j.out

cd ~/myproject || exit
buildah build --layers -t myproject .

# Run the image
podman run --gpus=all -e SLURM_JOB_ID=$SLURM_JOB_ID -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
  -v .:/app/ \
  -v /app/data \
  -v ./data/coco/:/app/data/coco \
  myproject bash -c "python3 train.py --gpu $CUDA_VISIBLE_DEVICES"

You can then run the job using:

sbatch job.sh

And see the running jobs using:

squeue

`sjm`

For an easier experience, you can use willGuimont/sjm. It allows to create temporary scripts easily and run them on a remote SLURM server.

Remote connection

SSH and X11 forwarding with GLX support

First make sure X11 forwarding is enabled server side. Check these two lines in /etc/ssh/sshd_config
```
X11Forwarding yes
X11DisplayOffset 10
```
If they were not, enable them and restart sshd
```
sudo systemctl restart sshd
```
Make sure xauth is available on the server or install it
```
sudo dnf install xorg-x11-xauth
```
Next install basic utilities to test GLX and Vulkan capabilities on the server. We'll need them to benchmark the remote connection's performance.
```
sudo dnf install glx-utils vulkan-tools
```
If you encounter problems, make sure on the client-side the server is allowed to display.
Make sure the ip of the server is valid. + to add - to remove from the trusted list.
```
xhost + 132.203.26.231
```
Connect to the server from your client using ssh.
Use the options -X or -Y to redirect X11 via the ssh tunnel. The redirection works despite the fact that the server is headless. But Xorg must be installed.
The -X option will automatically update the DSIPLAY env variable. Note: the ip of the server is subject to change, make sure to have the last updated.
```
ssh -X [email protected]
```
Test that X redirection is working by executing a simple X graphical application.
```
$ xterm
```
Test GLX support with glxinfo
```
glxinfo
```
Test what GLX implementation is used by default
```
$ glxinfo | grep -i vendor
server glx vendor string: SGI
client glx vendor string: Mesa Project and SGI
    Vendor: Mesa (0xffffffff)
OpenGL vendor string: Mesa
```
Check both NVidia and Mesa implementations work for GLX passthrough.
```
__GLX_VENDOR_LIBRARY_NAME=nvidia glxinfo | grep -i vendor
__GLX_VENDOR_LIBRARY_NAME=mesa glxinfo | grep -i vendor
```
Choose the best implementation between Nvidia and Mesa.
On Nvidia GPUs NVidia's implementation gives the best results.
```
export __GLX_VENDOR_LIBRARY_NAME=nvidia
glxgears
```
For Vulkan aplications the process is similar
```
vulkaninfo
VK_DRIVER_FILES="/usr/share/vulkan/icd.d/nvidia_icd.x86_64.json" vkcube
```