Model Serving

We serve the model using vLLM on each node.

▶️ Start vLLM

cd 30-serve-vllm
docker build -t qwen-vllm:local .
docker run --rm --net=host --gpus all \
  -v /models:/models \
  -e HF_HOME=/models/hf_cache \
  qwen-vllm:local bash -lc "./start_vllm.sh"

This starts a vLLM API server on port :8000.

📡 Test Inference

curl -s http://<node>:8000/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"Qwen/Qwen2-7B-Instruct","prompt":"Hello","max_tokens":16}'

The API returns a JSON response with the generated text.

🔀 Load Balancing

Use HAProxy/NGINX to expose a single entrypoint for multiple vLLM servers.

🖼️ Serving Diagram

Model Serving - eunki-7/llm-rdma-mlops-lab GitHub Wiki

Model Serving

▶️ Start vLLM

📡 Test Inference

🔀 Load Balancing

🖼️ Serving Diagram

⚠️ GitHub.com Fallback ⚠️

Model Serving - eunki-7/llm-rdma-mlops-lab GitHub Wiki

Model Serving

▶️ Start vLLM

📡 Test Inference

🔀 Load Balancing

🖼️ Serving Diagram

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️