Guide to GPUDrive setup on NYU HPC - Emerge-Lab/gpudrive GitHub Wiki

🧱 Installation (first time only)

Step 1: Clone the repository

Clone the gpudrive repository into your /home/$USER directory (info on HPC directories and data management).

git clone --recursive https://github.com/Emerge-Lab/gpudrive.git

Step 2: Navigate to repository

Move into the cloned repository folder:

cd gpudrive

Step 3: Set up overlay image

Create a directory for overlay files in the scratch directory:

mkdir -p /scratch/$USER/images/gpudrive
cd /scratch/$USER/images/gpudrive

Copy and decompress the overlay image:

cp /scratch/work/public/overlay-fs-ext3/overlay-50G-10M.ext3.gz .
gunzip overlay-50G-10M.ext3.gz

This may take a couple of minutes.

1. Verify the decompressed overlay image exists:

ls /scratch/$USER/images/gpudrive

What if I want to use a different overlay image?

To explore all available overlay images:

ls -l /scratch/work/public/overlay-fs-ext3/

Step 4: Request a GPU

srun --nodes=1 --tasks-per-node=1 --cpus-per-task=1 --mem=10GB --gres=gpu:1 \
--time=1:00:00 --account=<ASK> --pty /bin/bash

Ouput:

>>> srun: job XXXXXXX queued and waiting for resources
>>> srun: job XXXXXXX has been allocated resources

You will see something like:

[08:33:52 Wed Dec 25 2024] [email protected] ~/gpudrive

Ask Eugene for your account code if you don't have one yet.

Step 5: Launch Singularity container

Navigate back to main repo:

cd /home/$USER/gpudrive

Run the following to start the container with GPU support and the overlay image:

singularity exec --nv --overlay /scratch/$USER/images/gpudrive/overlay-50G-10M.ext3:rw \
/scratch/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif /bin/bash

You should see:

Singularity>

Details on Sinularity and overlay images on NYU HPC here.

Step 6: Set up Python environment

Inside the Singularity container, create a virtual environment:
One-off step: Create conda environment with Python 3.11

conda env create -f environment.yml

See the docs for how to set up a conda environment on Greene.

Why use conda? Currently, conda is an easy way to use a Python version > 3.8.6 on the NYU HPC without Docker.

Activate conda environment

conda activate gpudrive

Now you should see:

(/scratch/username/.conda/gpudrive) Singularity>

Step 7: Set up GPUDrive

We use the manual install option to set up GPUDrive, see the README for details.

If successful, you'll see

[100%] Linking CXX executable my_tests
[100%] Built target my_tests

Step 8: Verify installation

Launch Python:

python3

Then run:

import gpudrive

If there are no errors, the installation was successful!

Using `wandb`, `pufferlib` and running experiments

Set up Weights and Biases

Create wandb account and authorize
Set trusted certificates

export SSL_CERT_FILE=$(python -m certifi)
export REQUESTS_CA_BUNDLE=$(python -m certifi)

Set up Pufferlib

Install PufferLib with SSL Certificate Fixes

Update Certifi Package
Ensure the certifi package (provides root certificates) is up-to-date:

pip install --upgrade certifi

Why?
Keeps SSL certificates current to avoid issues with secure connections.

Set Trusted Certificates Manually (if needed)
Explicitly set the certificate bundle:

export SSL_CERT_FILE=$(python -m certifi)
export REQUESTS_CA_BUNDLE=$(python -m certifi)

Install pufferlib

pip install git+https://github.com/PufferAI/PufferLib.git@gpudrive

Run Self-Play PPO

Use the --help command to see the CLI configurable arguments:

Singularity> python baselines/ppo/ppo_pufferlib.py --help

⚡️ Usage | Interactive node

What is an interactive job and when should I use it?

In short, use interactive nodes for code development and testing.

Steps:

Request an interactive compute node, e.g:

srun --nodes=1 --tasks-per-node=1 --cpus-per-task=1 --mem=10GB --gres=gpu:1 \
--time=1:00:00 --account=<account_number> --pty /bin/bash

Replace <account_number> with your project number.

Navigate to repository:

cd /home/$USER/gpudrive

Launch the Singularity image:

singularity exec --nv --overlay /scratch/$USER/images/gpudrive/overlay-50G-10M.ext3:ro \
/scratch/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif /bin/bash

Activate the virtual environment:

conda activate gpudrive

Run experiments!

To run the Pufferlib PPO implementation, install Puffer first (not in requirements.yaml)

pip install git+https://github.com/PufferAI/PufferLib.git@gpudrive

python baselines/ippo/ippo_pufferlib.py

🚀 Usage | `sbatch`

What is sbatch and when should I use it?

In short, use sbatch for large runs, such as hyperparameter sweeps.

Steps:

[Optional] Define run configurations and hyperparameters to sweep over in generate_sbatch.py. Running it stores an sbatch script.

python examples/experiments/scripts/generate_sbatch.py

Submit sbatch jobs using

sbatch <your_sbatch_script>.sh

Through the use of job arrays, all the specified runs are launched at once.

Help

Do you encounter issues with one of the steps outlined above? Please reach out in the Emerge lab #code-help channel!