FAQ & Troubleshooting - eunki-7/llm-rdma-mlops-lab GitHub Wiki

FAQ & Troubleshooting

Common issues and fixes when running the distributed LLM lab.

❌ NCCL Timeout

Check firewall (open port 29500)
Verify MASTER_ADDR and MASTER_PORT values
Ensure all nodes can resolve hostnames
Run rdma_verify.sh to check RDMA connectivity

❌ GPU Not Detected

Run nvidia-smi to confirm drivers are loaded
Ensure NVIDIA Container Toolkit is installed
Check Docker run arguments: --gpus all

🐢 Model Loading is Slow

Use NVMe local caching instead of direct NFS reads
Pre-sync models: rsync -av /models/hf_cache /local_nvme/models/
Update HF_HOME to point to local cache

💥 Out of Memory (OOM)

Lower batch size or sequence length
Use DeepSpeed ZeRO-2/3 optimization
Enable gradient checkpointing in training script

🔌 Network Issues

Verify RoCEv2 or InfiniBand NICs are active
Use ibv_devinfo to list devices
Check NCCL_SOCKET_IFNAME matches the correct RDMA interface