vLLM - AshokBhat/ml GitHub Wiki
- LLM inference & serving engine
- Easy, Fast, and Open Source
- High throughput (Up to 24x)
- Efficient Memory Management: Uses PagedAttention
- Cross-Platform: GPUs and CPUs
- Transformer-like LLMs (e.g., Llama)
- Mixture-of-Expert LLMs (e.g., Mixtral)
- Multi-modal LLMs (e.g., LLaVA)
- State-Space Models (e.g., Jamba)
- Embedding Models (e.g. E5-Mistral)
- Reward Models (e.g. Qwen2.5-Math-RM)
from vllm import LLM # Example prompts. prompts = ["Hello, my name is", "The capital of France is"] # Create an LLM with HF model name. llm = LLM(model="meta-llama/Meta-Llama-3.1-8B") # Generate texts from the prompts. outputs = llm.generate(prompts) # also llm.chat(messages)]
$ vllm serve meta-llama/Meta-Llama-3.1-8B
$ curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Meta-Llama-3.1-8B", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 }'