CINECA DGXA100 SLURM Quick Start Guide - gfiameni/nvdoc-italy GitHub Wiki

DGX Guide to run experiments

Login in the DGX CINECA server using your credentials

login.dgx.cineca.it

Clone the project repository from GitHub

git clone {REPO_URL}

Copy the dataset in the server

Use SSH or a GUI client such as WinSCP

Run Slurm container

Use the srun command to run parallel jobs on a cluster managed by Slurm (Reference)

Arguments:

  • -p Partition requested for the resource allocation
  • -n Number of tasks
  • --gres Generic consumable resource, specify the type (es: gpu) and the number of resources needed
  • -t Time limit on the job allocation
  • -A Account to which the resources are allocated
  • --nodelist Specifies the list of hosts
  • --cpus-per-task How many cpus are to be allocated to a single task. Note that if the number of workers required by the DataLoader > cpus-per-task, the system will be in error.
  • --pty Executes the command as task zero

srun -p dgx_usr_prod -n1 --gres=gpu:1 -t 02:00:00 -A IscrC_LSMAP-AI --nodelist dgx01 --cpus-per-task 16 --pty /bin/bash

Set some environment variables

export NVIDIA_DRIVER_CAPABILITIES=all

Path to user image/credentials cache

export ENROOT_CACHE_PATH=/raid/scratch_local/$USER/enroot/tmp/enroot-cache/group-$(id -g)

Path to user container storage

export ENROOT_DATA_PATH=/raid/scratch_local/$USER/enroot/tmp/enroot-data/user-$(id -u)

Path to the runtime working directory

export ENROOT_RUNTIME_PATH=/raid/scratch_local/$USER/enroot/tmp/enroot-runtime/user-$(id -u)

Navigate to the project folder

cd {PROJECT_NAME}

Import and create a Pytorch container

enroot import -o pytorch_2110.sqsh 'docker://@nvcr.io#nvidia/pytorch:21.10-py3'

enroot create pytorch_2110.sqsh

Start the container with the defined settings

Use enroot to turn the container image into an unprivileged sandbox (Reference)

Arguments

  • --mount Mount the current directory
  • --root Remamps to the root of the container
  • --env Exports an environment variable
  • --rw Gives permission to write

enroot start --mount $PWD:/{PROJECT_NAME} --root --env NVIDIA_DRIVER_CAPABILITIES --rw pytorch_2010

Navigate the project folder inside the container

cd /{PROJECT_NAME}

Launch your experiments

python main.py