How to use the Sbatch File for ASR Training - cereal-d3v/LLM-ASR GitHub Wiki

Overview of the sbatch File

This sbatch file is designed to run a distributed training process using torchrun on the NPL partition of the cluster. It uses 8 GPUs per node and supports up to one node.

Key Features:

  • Allocates computational resources (GPUs, CPUs, and runtime).
  • Sets up a virtual environment for Python dependencies.
  • Configures distributed training parameters (e.g., world size, master address, etc.).
  • Uses torchrun to coordinate distributed training.

Steps to Use

1. Submit the Job

  1. Ensure the sbatch file is saved in your project directory (e.g., parent of scratch-shared/).
  2. Submit the job using the following command:
    sbatch srun_multinode_npl.sh
    

2. Modify Required Sections

To customize the script for your specific setup, adjust the following sections:

Job Metadata

  • job-name=large-npl: Replace large-npl with a name for your job to make it identifiable in the queue.
  • partition=npl-2024: Verify and replace the partition name with one available for your cluster.
  • mail-user=[email protected]: Replace with your email address to receive job notifications.

Environment Setup

  • source activate asr: Ensure the virtual environment is named correctly. If your virtual environment has a different name, replace asr with your environment’s name.
  • cd scratch-shared/partial-asr: Update the path to match your project directory structure.

Torchrun Settings

  • nproc_per_node=$SLURM_NTASKS_PER_NODE: Adjust if using fewer or more GPUs per node.
  • rdzv_id=456: Modify this rendezvous ID if running multiple jobs simultaneously to avoid conflicts.