How to use the Sbatch File for ASR Training - cereal-d3v/LLM-ASR GitHub Wiki
sbatch
File
Overview of the This sbatch
file is designed to run a distributed training process using torchrun
on the NPL partition of the cluster. It uses 8 GPUs per node and supports up to one node.
Key Features:
- Allocates computational resources (GPUs, CPUs, and runtime).
- Sets up a virtual environment for Python dependencies.
- Configures distributed training parameters (e.g., world size, master address, etc.).
- Uses
torchrun
to coordinate distributed training.
Steps to Use
1. Submit the Job
- Ensure the
sbatch
file is saved in your project directory (e.g., parent ofscratch-shared/
). - Submit the job using the following command:
sbatch srun_multinode_npl.sh
2. Modify Required Sections
To customize the script for your specific setup, adjust the following sections:
Job Metadata
- job-name=large-npl: Replace large-npl with a name for your job to make it identifiable in the queue.
- partition=npl-2024: Verify and replace the partition name with one available for your cluster.
- mail-user=[email protected]: Replace with your email address to receive job notifications.
Environment Setup
- source activate asr: Ensure the virtual environment is named correctly. If your virtual environment has a different name, replace asr with your environment’s name.
- cd scratch-shared/partial-asr: Update the path to match your project directory structure.
Torchrun Settings
- nproc_per_node=$SLURM_NTASKS_PER_NODE: Adjust if using fewer or more GPUs per node.
- rdzv_id=456: Modify this rendezvous ID if running multiple jobs simultaneously to avoid conflicts.