Mamba Server - norlab-ulaval/Norlab_wiki GitHub Wiki

Running jobs

As users do not have root access on the Mamba Server, every project should be ran in a container. We recommend using podman.

First, on your host machine, write a Dockerfile to run your project inside a container. Then, build and test that everything works on your machine before testing it on the server.

We recommend putting your data in a directory and to symlink it to the data folder of your project. We describe here how to add volumes to avoid copying the data in the container.

# Build the image
buildah build --layers -t myproject .

# Test variables
export CUDA_VISIBLE_DEVICES=0 # or `0,1` for specific GPUs, will be automatically set by SLURM

# Run the image
podman run --gpus=all -e SLURM_JOB_ID=$SLURM_JOB_ID -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
  -v .:/app/ \
  -v /app/data \
  -v ./data/coco/:/app/data/coco \
  myproject bash -c "python3 train.py --gpu $CUDA_VISIBLE_DEVICES"

After you verified everything works on your machine, copy the code on the server and write a Slurm job script.

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=16
#SBATCH --time=10-00:00
#SBATCH --job-name=job_name
#SBATCH --output=%x-%j.out

cd ~/myproject || exit
buildah build --layers -t myproject .

# Run the image
podman run --gpus=all -e SLURM_JOB_ID=$SLURM_JOB_ID -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
  -v .:/app/ \
  -v /app/data \
  -v ./data/coco/:/app/data/coco \
  myproject bash -c "python3 train.py --gpu $CUDA_VISIBLE_DEVICES"

You can then run the job using:

sbatch job.sh

And see the running jobs using:

squeue

sjm

For an easier experience, you can use willGuimont/sjm. It allows to create temporary scripts easily and run them on a remote SLURM server.

Remote connection

  1. SSH and X11 forwarding with GLX support

    First make sure X11 forwarding is enabled server side. Check these two lines in /etc/ssh/sshd_config

    X11Forwarding yes
    X11DisplayOffset 10
    

    If they were not, enable them and restart sshd

    sudo systemctl restart sshd
    

    Make sure xauth is available on the server or install it

    sudo dnf install xorg-x11-xauth
    

    Next install basic utilities to test GLX and Vulkan capabilities on the server. We'll need them to benchmark the remote connection's performance.

    sudo dnf install glx-utils vulkan-tools
    

    If you encounter problems, make sure on the client-side the server is allowed to display.
    Make sure the ip of the server is valid. + to add - to remove from the trusted list.

    xhost + 132.203.26.231
    

    Connect to the server from your client using ssh.
    Use the options -X or -Y to redirect X11 via the ssh tunnel. The redirection works despite the fact that the server is headless. But Xorg must be installed.
    The -X option will automatically update the DSIPLAY env variable. Note: the ip of the server is subject to change, make sure to have the last updated.

    ssh -X [email protected]
    

    Test that X redirection is working by executing a simple X graphical application.

    $ xterm
    

    Test GLX support with glxinfo

    glxinfo
    

    Test what GLX implementation is used by default

    $ glxinfo | grep -i vendor
    server glx vendor string: SGI
    client glx vendor string: Mesa Project and SGI
        Vendor: Mesa (0xffffffff)
    OpenGL vendor string: Mesa
    

    Check both NVidia and Mesa implementations work for GLX passthrough.

    __GLX_VENDOR_LIBRARY_NAME=nvidia glxinfo | grep -i vendor
    __GLX_VENDOR_LIBRARY_NAME=mesa glxinfo | grep -i vendor
    

    Choose the best implementation between Nvidia and Mesa.
    On Nvidia GPUs NVidia's implementation gives the best results.

    export __GLX_VENDOR_LIBRARY_NAME=nvidia
    glxgears
    

    For Vulkan aplications the process is similar

    vulkaninfo
    VK_DRIVER_FILES="/usr/share/vulkan/icd.d/nvidia_icd.x86_64.json" vkcube