Distributed Training - eunki-7/llm-rdma-mlops-lab GitHub Wiki

Distributed Training

We use torchrun + DeepSpeed to fine-tune HuggingFace models across 4 nodes.

⚙️ Environment Variables

See 20-train-ddp/env.example:

export MASTER_ADDR=10.0.0.10
export MASTER_PORT=29500
export NNODES=4
export NPROC_PER_NODE=1
export NODE_RANK=0

These variables define the master node, ports, number of nodes, processes per node, and the current node rank.

🚀 Run Training

docker build -t qwen-train:local 20-train-ddp
docker run --rm --net=host --gpus all \
  -v /models:/models -v /data:/data -v /outputs:/outputs \
  -e HF_HOME=/models/hf_cache \
  qwen-train:local bash -lc "./launch_ds.sh"

This command launches distributed supervised fine-tuning with DeepSpeed and torchrun.

💾 Checkpoints

Written to /outputs
Periodically synced to NFS using storage/rsync/sync_outputs.sh

🖼️ Training Diagram