Home - eunki-7/llm-rdma-mlops-lab GitHub Wiki
LLM RDMA + NCCL A100 4-Node Lab
Welcome to the llm-rdma-mlops-lab Wiki 🎉
This Wiki provides detailed documentation for setting up, running, and monitoring a distributed LLM infrastructure using PyTorch DDP, DeepSpeed, vLLM, NCCL, and RDMA.
📘 Contents
- Prerequisites
- NCCL Tests
- Distributed Training
- Model Serving
- Traffic & Monitoring
- Kubernetes (Optional)
- FAQ & Troubleshooting
🎯 Purpose
A hands-on blueprint for enterprise-grade distributed LLM training and serving operations,
featuring performance, scalability, and observability best practices.