FAQ & Troubleshooting - eunki-7/llm-rdma-mlops-lab GitHub Wiki

FAQ & Troubleshooting

Common issues and fixes when running the distributed LLM lab.


❌ NCCL Timeout

  • Check firewall (open port 29500)
  • Verify MASTER_ADDR and MASTER_PORT values
  • Ensure all nodes can resolve hostnames
  • Run rdma_verify.sh to check RDMA connectivity

❌ GPU Not Detected

  • Run nvidia-smi to confirm drivers are loaded
  • Ensure NVIDIA Container Toolkit is installed
  • Check Docker run arguments: --gpus all

🐢 Model Loading is Slow

  • Use NVMe local caching instead of direct NFS reads
  • Pre-sync models: rsync -av /models/hf_cache /local_nvme/models/
  • Update HF_HOME to point to local cache

💥 Out of Memory (OOM)

  • Lower batch size or sequence length
  • Use DeepSpeed ZeRO-2/3 optimization
  • Enable gradient checkpointing in training script

🔌 Network Issues

  • Verify RoCEv2 or InfiniBand NICs are active
  • Use ibv_devinfo to list devices
  • Check NCCL_SOCKET_IFNAME matches the correct RDMA interface