vllm_intro - OpenNebula/one-apps GitHub Wiki

Overview

vLLM is a high-performance inference engine optimized for serving transformer LLMs with low latency, high throughput, token-level streaming, and efficient GPU memory usage. This appliance packages vLLM into a ready-to-run and configurable OpenNebula VM image to simplify deployment on your OpenNebula cloud for inference workloads.

The appliance provides a streamlined solution for building and serving end-to-end AI applications, utilizing pre-trained models from the Hugging Face Transformers library.

Download

The latest version of the vLLM appliance is available for download from the OpenNebula public Marketplace:

vLLM

Requirements

Minimum requirements vary depending on the selected LLM and its size. We're currently developing hardware recommendations for each available model. However, to ensure optimal performance with even the smallest model, we recommend provisioning a virtual machine with at least 8 GB of RAM and a GPU with a minimum of 14 GB of vRAM.

Release Notes

Detailed release notes for each version are available on the OpenNebula release page, providing comprehensive insights into version-specific updates. The vLLM appliance is based on Ubuntu 24.04 LTS (x86-64).

Component	Version
vLLM	0.10.2

Next: vLLM Quick Start