How to run on WatGPU - Blood-Glucose-Control/nocturnal-hypo-gly-prob-forecast GitHub Wiki

Setup Guide for Running Benchmark on WatGPU

Prerequisites

  • Access to WATGPU cluster (UWaterloo CS department)

Installation Guide

1. Connect to WATGPU Server

ssh <your_username>@watgpu.cs.uwaterloo.ca

Note: Replace <your_username> with your UWaterloo username.

2. Clone the Repository

# Navigate to home directory (REQUIRED - must be in HOME directory)
cd ~

# Clone the repository
git clone https://github.com/Blood-Glucose-Control/nocturnal-hypo-gly-prob-forecast.git

3. Set Up Python Environment

# Enter project directory
cd nocturnal-hypo-gly-prob-forecast

# Create virtual environment with Python 3.11
python3.11 -m venv .noctprob-venv

# Activate the virtual environment
source .noctprob-venv/bin/activate

Note: The server comes with a base conda environment that includes Python 3.11.

4. Install Dependencies

# Install required packages
pip install -r requirements.txt

# Install the project package in development mode
pip install -e .

Job Submission Guidelines

⚠️ IMPORTANT: Server Usage Policy

  • NEVER RUN SCRIPTS DIRECTLY ON THE WATGPU LOGIN SERVER
  • The login server is for job submission only
  • All script execution must use sbatch
  • Reference: How to submit a job

Project Structure

All scripts are located at ~/nocturnal-hypo-gly-prob-forecast/scripts/watgpu/:

Key files:

  • job.sh: Configure YAML files and run resources
  • run_model.py: Entry point for the benchmark

Job Submission Process

1. Configure job.sh

Resource and YAML Configuration:

declare -A job_specs=(
    ["0_naive_05min.yaml"]="1 4 02:00:00"
    ["0_naive_15min.yaml"]="1 3 02:00:00"
)

Format: [yaml_file]="cores memory(GB) time(HH:MM:SS)"
Note: Queue time limit is 7 days maximum

Email Notification:

Run Description:

description="This run evaluates the impact of removing exogenous variables (IOB and COB)
to determine if there is any performance degradation compared to baseline."

Add a clear explanation of:

  • The purpose of this run
  • Why you're running this experiment
  • Key changes from previous runs

2. Submit the Job

cd ~/nocturnal-hypo-gly-prob-forecast/scripts/watgpu/
bash job.sh

You'll receive a job ID after submission (e.g., Submitted batch job 12345)

Results Location

Log Files:

  • Located in scripts/watgpu/
  • JOB<jobid>.out: Standard output
  • JOB<jobid>.err: Error messages

Results Directory:
Check results/processed/ for a timestamped folder containing:

  • Configuration details
  • Performance metrics from different scorers
  • Folder name includes run timestamp

SLURM Reference Guide

Resource Monitoring

CPU Status:

sinfo -o "%C"

Output shows: CPUS(A/I/O/T)

  • A: Allocated (in use)
  • I: Idle (available)
  • O: Other (down/maintenance)
  • T: Total CPUs

GPU Status:

sinfo -o "%n %G"

Shows available GPUs per node

Memory Status:

sinfo -o "%n %m"

Shows memory (MB) per node

Job Management

View Your Jobs:

# Basic job status
squeue -u $USER

# Detailed job information
squeue -o "%.18i %.9P %.15j %.8u %.2t %.10M %.6D %C %.6m" | grep $USER

Shows: JobID, Partition, JobName, User, State, Time, Nodes, CPUs, Memory

Control Jobs:

# Cancel a specific job
scancel <jobid>

# Cancel all your jobs
scancel -u $USER
⚠️ **GitHub.com Fallback** ⚠️