Prerequisites - eunki-7/llm-rdma-mlops-lab GitHub Wiki

Prerequisites

This page describes how to prepare your environment before running multi-node training.

Hardware

  • 4 servers, each with:
    • 1× NVIDIA A100 GPU
    • RDMA-capable NIC (RoCEv2 or InfiniBand)
    • NVMe local SSD for caching

Software

  • Ubuntu 22.04
  • NVIDIA Driver ≥ 550
  • CUDA Toolkit ≥ 12.4
  • Docker + NVIDIA Container Toolkit
  • RDMA Core libraries (rdma-core, ibverbs-utils)
  • NFS (server on node0, clients on node1~3)

Time Sync

  • Ensure all nodes use NTP or chrony to stay in sync