vLLM - AshokBhat/ml GitHub Wiki

About

LLM inference & serving engine
Easy, Fast, and Open Source
High throughput (Up to 24x)
Efficient Memory Management: Uses PagedAttention
Cross-Platform: GPUs and CPUs

Model support

Transformer-like LLMs (e.g., Llama)
Mixture-of-Expert LLMs (e.g., Mixtral)
Multi-modal LLMs (e.g., LLaVA)
State-Space Models (e.g., Jamba)
Embedding Models (e.g. E5-Mistral)
Reward Models (e.g. Qwen2.5-Math-RM)

vLLM API (1): LLM class - Python interface for offline batched inference

from vllm import LLM

# Example prompts.
prompts = ["Hello, my name is", "The capital of France is"]

# Create an LLM with HF model name.
llm = LLM(model="meta-llama/Meta-Llama-3.1-8B")

# Generate texts from the prompts. 
outputs = llm.generate(prompts) # also llm.chat(messages)]

vLLM API (2): OpenAI-compatible server

Server

$ vllm serve meta-llama/Meta-Llama-3.1-8B

Client

$ curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Meta-Llama-3.1-8B",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

vLLM - AshokBhat/ml GitHub Wiki

About

Model support

vLLM API (1): LLM class - Python interface for offline batched inference

vLLM API (2): OpenAI-compatible server

Server

Client

See also

⚠️ GitHub.com Fallback ⚠️

vLLM - AshokBhat/ml GitHub Wiki

About

Model support

vLLM API (1): LLM class - Python interface for offline batched inference

vLLM API (2): OpenAI-compatible server

Server

Client

See also

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️