Model Serving - eunki-7/llm-rdma-mlops-lab GitHub Wiki
We serve the model using vLLM on each node.
cd 30-serve-vllm
docker build -t qwen-vllm:local .
docker run --rm --net=host --gpus all \
-v /models:/models \
-e HF_HOME=/models/hf_cache \
qwen-vllm:local bash -lc "./start_vllm.sh"
This starts a vLLM API server on port :8000
.
curl -s http://<node>:8000/v1/completions \
-H 'Content-Type: application/json' \
-d '{"model":"Qwen/Qwen2-7B-Instruct","prompt":"Hello","max_tokens":16}'
The API returns a JSON response with the generated text.
- Use HAProxy/NGINX to expose a single entrypoint for multiple vLLM servers.
