Model Serving - eunki-7/llm-rdma-mlops-lab GitHub Wiki

Model Serving

We serve the model using vLLM on each node.


▶️ Start vLLM

cd 30-serve-vllm
docker build -t qwen-vllm:local .
docker run --rm --net=host --gpus all \
  -v /models:/models \
  -e HF_HOME=/models/hf_cache \
  qwen-vllm:local bash -lc "./start_vllm.sh"

This starts a vLLM API server on port :8000.


📡 Test Inference

curl -s http://<node>:8000/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"Qwen/Qwen2-7B-Instruct","prompt":"Hello","max_tokens":16}'

The API returns a JSON response with the generated text.


🔀 Load Balancing

  • Use HAProxy/NGINX to expose a single entrypoint for multiple vLLM servers.

🖼️ Serving Diagram

4
⚠️ **GitHub.com Fallback** ⚠️