Training and Testing Data on CCI NPL Cluster - cereal-d3v/LLM-ASR GitHub Wiki

This guide provides detailed instructions on how to train and test your models using the CCI NPL cluster at RPI. Please follow the steps carefully to ensure proper setup and execution.

Step 1: SSH into the CCI Cluster

Connect to the CCI cluster’s head node (npl01) using SSH.

ssh npl01

•	Note: You will be prompted for your CCI (NLUG) password.
•	MFA Prompt: Enter your Duo Mobile code (MFA) when prompted.

Step 2: Request Resources with salloc

Run the following command to allocate resources for your training job:

salloc -N 10 -p npl-2024 --gres=gpu:8 -t 360

Explanation of the salloc Command:

Flag Description Example Value -N Number of nodes to request. -N 10 -p Partition to use (e.g., npl-2024). -p npl-2024 --gres=gpu: Number of GPUs per node. --gres=gpu:8 -t Maximum runtime in minutes. -t 360

Example Output:

salloc: Pending job allocation 1092806 salloc: job 1092806 queued and waiting for resources salloc: job 1092806 has been allocated resources salloc: Granted job allocation 1092806 salloc: Waiting for resource configuration salloc: Nodes npl21-30 are ready for job

•	Note: Your allocated nodes (e.g., npl21 to npl30) are now ready for use.

Step 3: SSH into Allocated Nodes (Optional)

In a new terminal window, SSH into each of the allocated nodes (if you need to check or run commands on them directly):

ssh npl21 ssh npl22 ... ssh npl30

•	Tip: You can use a loop or script to SSH into multiple nodes quickly.

Step 4: Check Job Status with squeue

To check the status of your job and see if it’s running, use:

squeue

Example Output:

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1092806 npl-2024 large-np NLUGvlsz R 0:05 10 npl21,npl22,...,npl30

•	ST Column: The job state (R for running, PD for pending).

Step 5: Launch the Training Job with accelerate

Run the generate.py script using the accelerate launcher. This script uses distributed training across the allocated GPUs.

accelerate launch --num_processes=8 generate.py

Example Usage:

salloc -N 5 -p npl-2024 --gres=gpu:8 -t 360

Calculating Batch Size:

If you trained on 40,000 samples with a batch size of 36, and you’re using 8 GPUs across 5 nodes, the calculation is: • Total number of batches per epoch: • Note: Adjust the batch size based on your dataset size and hardware capabilities.

Step 6: Training Tips

•	Use npl-2024 Partition: This is the most recent and recommended partition for your jobs.
•	Check GPU Usage: Use nvidia-smi on each node to monitor GPU utilization.

nvidia-smi

Troubleshooting

1.	Job Not Starting: If your job is stuck in the PD (pending) state, it may be due to insufficient resources. Try reducing the number of nodes or GPUs requested.
2.	SSH Timeout: If you lose connection, re-establish the SSH session and run squeue to check the job status. Your job will continue running even if you disconnect.
3.	Environment Issues: Ensure you activate the correct environment before running the job.

source activate asr cd scratch-shared/partial-asr source venv/bin/activate

Additional Notes

•	Resource Limits: Be mindful of the maximum time and resource limits imposed by the partition. Check the documentation or contact your admin if unsure.
•	Automating SSH into Nodes: Consider setting up SSH multiplexing for faster access:

ssh -O check npl01

This should streamline your workflow for training and testing on the CCI NPL cluster using VSCode and terminal commands.

Training and Testing Data on CCI NPL Cluster - cereal-d3v/LLM-ASR GitHub Wiki

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️