CINECA DGXA100 SLURM Quick Start Guide - gfiameni/nvdoc-italy GitHub Wiki
DGX Guide to run experiments
Login in the DGX CINECA server using your credentials
login.dgx.cineca.it
Clone the project repository from GitHub
git clone {REPO_URL}
Copy the dataset in the server
Use SSH or a GUI client such as WinSCP
Run Slurm container
Use the srun
command to run parallel jobs on a cluster managed by Slurm (Reference)
Arguments:
-p
Partition requested for the resource allocation-n
Number of tasks--gres
Generic consumable resource, specify the type (es: gpu) and the number of resources needed-t
Time limit on the job allocation-A
Account to which the resources are allocated--nodelist
Specifies the list of hosts--cpus-per-task
How many cpus are to be allocated to a single task. Note that if the number of workers required by the DataLoader > cpus-per-task, the system will be in error.--pty
Executes the command as task zero
srun -p dgx_usr_prod -n1 --gres=gpu:1 -t 02:00:00 -A IscrC_LSMAP-AI --nodelist dgx01 --cpus-per-task 16 --pty /bin/bash
Set some environment variables
export NVIDIA_DRIVER_CAPABILITIES=all
Path to user image/credentials cache
export ENROOT_CACHE_PATH=/raid/scratch_local/$USER/enroot/tmp/enroot-cache/group-$(id -g)
Path to user container storage
export ENROOT_DATA_PATH=/raid/scratch_local/$USER/enroot/tmp/enroot-data/user-$(id -u)
Path to the runtime working directory
export ENROOT_RUNTIME_PATH=/raid/scratch_local/$USER/enroot/tmp/enroot-runtime/user-$(id -u)
Navigate to the project folder
cd {PROJECT_NAME}
Import and create a Pytorch container
enroot import -o pytorch_2110.sqsh 'docker://@nvcr.io#nvidia/pytorch:21.10-py3'
enroot create pytorch_2110.sqsh
Start the container with the defined settings
Use enroot
to turn the container image into an unprivileged sandbox (Reference)
Arguments
--mount
Mount the current directory--root
Remamps to the root of the container--env
Exports an environment variable--rw
Gives permission to write
enroot start --mount $PWD:/{PROJECT_NAME} --root --env NVIDIA_DRIVER_CAPABILITIES --rw pytorch_2010
Navigate the project folder inside the container
cd /{PROJECT_NAME}
Launch your experiments
python main.py