Distributed Training - eunki-7/llm-rdma-mlops-lab GitHub Wiki
Distributed Training
We use torchrun + DeepSpeed to fine-tune HuggingFace models across 4 nodes.
⚙️ Environment Variables
See 20-train-ddp/env.example
:
export MASTER_ADDR=10.0.0.10
export MASTER_PORT=29500
export NNODES=4
export NPROC_PER_NODE=1
export NODE_RANK=0
These variables define the master node, ports, number of nodes, processes per node, and the current node rank.
🚀 Run Training
docker build -t qwen-train:local 20-train-ddp
docker run --rm --net=host --gpus all \
-v /models:/models -v /data:/data -v /outputs:/outputs \
-e HF_HOME=/models/hf_cache \
qwen-train:local bash -lc "./launch_ds.sh"
This command launches distributed supervised fine-tuning with DeepSpeed and torchrun.
💾 Checkpoints
- Written to
/outputs
- Periodically synced to NFS using
storage/rsync/sync_outputs.sh