How to use the Sbatch File for ASR Training - cereal-d3v/LLM-ASR GitHub Wiki

Overview of the `sbatch` File

This sbatch file is designed to run a distributed training process using torchrun on the NPL partition of the cluster. It uses 8 GPUs per node and supports up to one node.

Key Features:

Allocates computational resources (GPUs, CPUs, and runtime).
Sets up a virtual environment for Python dependencies.
Configures distributed training parameters (e.g., world size, master address, etc.).
Uses torchrun to coordinate distributed training.

Steps to Use

1. Submit the Job

Ensure the sbatch file is saved in your project directory (e.g., parent of scratch-shared/).
Submit the job using the following command:
```
sbatch srun_multinode_npl.sh
```

2. Modify Required Sections

To customize the script for your specific setup, adjust the following sections:

Job Metadata

job-name=large-npl: Replace large-npl with a name for your job to make it identifiable in the queue.
partition=npl-2024: Verify and replace the partition name with one available for your cluster.
mail-user=[email protected]: Replace with your email address to receive job notifications.

Environment Setup

source activate asr: Ensure the virtual environment is named correctly. If your virtual environment has a different name, replace asr with your environment’s name.
cd scratch-shared/partial-asr: Update the path to match your project directory structure.

Torchrun Settings

nproc_per_node=$SLURM_NTASKS_PER_NODE: Adjust if using fewer or more GPUs per node.
rdzv_id=456: Modify this rendezvous ID if running multiple jobs simultaneously to avoid conflicts.