Home - eunki-7/llm-rdma-mlops-lab GitHub Wiki

LLM RDMA + NCCL A100 4-Node Lab

Welcome to the llm-rdma-mlops-lab Wiki 🎉
This Wiki provides detailed documentation for setting up, running, and monitoring a distributed LLM infrastructure using PyTorch DDP, DeepSpeed, vLLM, NCCL, and RDMA.


📘 Contents


🎯 Purpose

A hands-on blueprint for enterprise-grade distributed LLM training and serving operations,
featuring performance, scalability, and observability best practices.


📊 Architecture