vLLM - AshokBhat/ml GitHub Wiki

About

  • LLM inference & serving engine
  • Easy, Fast, and Open Source
  • High throughput (Up to 24x)
  • Efficient Memory Management: Uses PagedAttention
  • Cross-Platform: GPUs and CPUs

Model support

  • Transformer-like LLMs (e.g., Llama)
  • Mixture-of-Expert LLMs (e.g., Mixtral)
  • Multi-modal LLMs (e.g., LLaVA)
  • State-Space Models (e.g., Jamba)
  • Embedding Models (e.g. E5-Mistral)
  • Reward Models (e.g. Qwen2.5-Math-RM)

vLLM API (1): LLM class - Python interface for offline batched inference

from vllm import LLM

# Example prompts.
prompts = ["Hello, my name is", "The capital of France is"]

# Create an LLM with HF model name.
llm = LLM(model="meta-llama/Meta-Llama-3.1-8B")

# Generate texts from the prompts. 
outputs = llm.generate(prompts) # also llm.chat(messages)]

vLLM API (2): OpenAI-compatible server

Server

$ vllm serve meta-llama/Meta-Llama-3.1-8B

Client

$ curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Meta-Llama-3.1-8B",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

See also

⚠️ **GitHub.com Fallback** ⚠️