FAQ & Troubleshooting - eunki-7/llm-rdma-mlops-lab GitHub Wiki
FAQ & Troubleshooting
Common issues and fixes when running the distributed LLM lab.
❌ NCCL Timeout
- Check firewall (open port 29500)
- Verify
MASTER_ADDR
andMASTER_PORT
values - Ensure all nodes can resolve hostnames
- Run
rdma_verify.sh
to check RDMA connectivity
❌ GPU Not Detected
- Run
nvidia-smi
to confirm drivers are loaded - Ensure NVIDIA Container Toolkit is installed
- Check Docker run arguments:
--gpus all
🐢 Model Loading is Slow
- Use NVMe local caching instead of direct NFS reads
- Pre-sync models:
rsync -av /models/hf_cache /local_nvme/models/
- Update
HF_HOME
to point to local cache
💥 Out of Memory (OOM)
- Lower batch size or sequence length
- Use DeepSpeed ZeRO-2/3 optimization
- Enable gradient checkpointing in training script
🔌 Network Issues
- Verify RoCEv2 or InfiniBand NICs are active
- Use
ibv_devinfo
to list devices - Check
NCCL_SOCKET_IFNAME
matches the correct RDMA interface