Prerequisites - eunki-7/llm-rdma-mlops-lab GitHub Wiki
Prerequisites
This page describes how to prepare your environment before running multi-node training.
Hardware
- 4 servers, each with:
- 1× NVIDIA A100 GPU
- RDMA-capable NIC (RoCEv2 or InfiniBand)
- NVMe local SSD for caching
Software
- Ubuntu 22.04
- NVIDIA Driver ≥ 550
- CUDA Toolkit ≥ 12.4
- Docker + NVIDIA Container Toolkit
- RDMA Core libraries (
rdma-core
,ibverbs-utils
) - NFS (server on node0, clients on node1~3)
Time Sync
- Ensure all nodes use NTP or chrony to stay in sync