How to run on WatGPU - Blood-Glucose-Control/nocturnal-hypo-gly-prob-forecast GitHub Wiki
- Access to WATGPU cluster (UWaterloo CS department)
ssh <your_username>@watgpu.cs.uwaterloo.ca
Note: Replace
<your_username>
with your UWaterloo username.
# Navigate to home directory (REQUIRED - must be in HOME directory)
cd ~
# Clone the repository
git clone https://github.com/Blood-Glucose-Control/nocturnal-hypo-gly-prob-forecast.git
# Enter project directory
cd nocturnal-hypo-gly-prob-forecast
# Create virtual environment with Python 3.11
python3.11 -m venv .noctprob-venv
# Activate the virtual environment
source .noctprob-venv/bin/activate
Note: The server comes with a base conda environment that includes Python 3.11.
# Install required packages
pip install -r requirements.txt
# Install the project package in development mode
pip install -e .
- NEVER RUN SCRIPTS DIRECTLY ON THE WATGPU LOGIN SERVER
- The login server is for job submission only
- All script execution must use
sbatch
- Reference: How to submit a job
All scripts are located at ~/nocturnal-hypo-gly-prob-forecast/scripts/watgpu/
:
Key files:
-
job.sh
: Configure YAML files and run resources -
run_model.py
: Entry point for the benchmark
Resource and YAML Configuration:
declare -A job_specs=(
["0_naive_05min.yaml"]="1 4 02:00:00"
["0_naive_15min.yaml"]="1 3 02:00:00"
)
Format:
[yaml_file]="cores memory(GB) time(HH:MM:SS)"
Note: Queue time limit is 7 days maximum
Email Notification:
email="[email protected]"
Run Description:
description="This run evaluates the impact of removing exogenous variables (IOB and COB)
to determine if there is any performance degradation compared to baseline."
Add a clear explanation of:
- The purpose of this run
- Why you're running this experiment
- Key changes from previous runs
cd ~/nocturnal-hypo-gly-prob-forecast/scripts/watgpu/
bash job.sh
You'll receive a job ID after submission (e.g., Submitted batch job 12345
)
Log Files:
- Located in
scripts/watgpu/
-
JOB<jobid>.out
: Standard output -
JOB<jobid>.err
: Error messages
Results Directory:
Check results/processed/
for a timestamped folder containing:
- Configuration details
- Performance metrics from different scorers
- Folder name includes run timestamp
CPU Status:
sinfo -o "%C"
Output shows: CPUS(A/I/O/T)
- A: Allocated (in use)
- I: Idle (available)
- O: Other (down/maintenance)
- T: Total CPUs
GPU Status:
sinfo -o "%n %G"
Shows available GPUs per node
Memory Status:
sinfo -o "%n %m"
Shows memory (MB) per node
View Your Jobs:
# Basic job status
squeue -u $USER
# Detailed job information
squeue -o "%.18i %.9P %.15j %.8u %.2t %.10M %.6D %C %.6m" | grep $USER
Shows: JobID, Partition, JobName, User, State, Time, Nodes, CPUs, Memory
Control Jobs:
# Cancel a specific job
scancel <jobid>
# Cancel all your jobs
scancel -u $USER