vllm_feature - OpenNebula/one-apps GitHub Wiki

Features and Usage

The appliance includes vLLM, an open-source, high-performance inference engine designed for serving large language models with low latency and high throughput.

This appliance streamlines the deployment of model-serving applications and integrates seamlessly with models available on Hugging Face (note: a Hugging Face account and token may be required for access to some models). It includes a default application to serve the selected model, along with a web-based client for interactive inference.

Contextualization

The appliance's behavior and configuration are controlled by contextualization parameters specified in the VM template's Context Section. Below are the primary configurable aspects:

Inference API

The model will be exposed through an API that can be consumed by your application, you can control it with the following parameters. You can also deploy a web application to interact with the model.

Parameter Default Description
ONEAPP_VLLM_API_PORT 8000 Port number for the API endpoint.
ONEAPP_VLLM_API_WEB YES Deploy a web application to interact with the model.

Inference Model

Parameter Default Description
ONEAPP_VLLM_MODEL_ID Qwen/Qwen2.5-1.5B-Instruct Determines the LLM model to use for inference.
ONEAPP_VLLM_MODEL_MAX_LENGTH 1024 Maximum number of tokens (input + output) allowed per inference request.
ONEAPP_VLLM_MODEL_QUANTIZATION 0 Reduce memory usage and improve inference speed by compressing LLM weights (0, 4).
ONEAPP_VLLM_ENFORCE_EAGER NO Whether to always use eager-mode PyTorch in vllm.
ONEAPP_VLLM_SLEEP_MODE NO Whether to enable sleep mode when GPU is used.
ONEAPP_VLLM_GPU_MEMORY_UTILIZATION 0.9 Fraction of GPU memory to use (from 0.1 to 1.0).
ONEAPP_VLLM_MODEL_TOKEN - Hugging Face token to access the specified LLM model.

Using the Appliance from the CLI

vLLM is installed in a virtual environment. Depending on if you want to use the CPU or GPU for inference, you will need to activate the corresponding venv before using the vllm application (vllm_cpu_env for cpu usage and vllm_gpu_env for gpu usage), e.g. :

root@vLLM:~# source /root/vllm_gpu_env/bin/activate

(vllm_gpu_env) root@vLLM:~# vllm serve Qwen/Qwen2.5-1.5B-Instruct

Using GPUs

The appliance is designed to utilize all available CPU and GPU resources in the VM by default. By default, the NVIDIA drivers for the branch 570 are installed. Optionally, you can install other driver versions during the appliance build process using the NVIDIA_DRIVER_PATH environment variable when executing the Vllm appliance build Makefile recipe. This variable could contain an URL or a local path with the drivers package, e.g.:

sudo NVIDIA_DRIVER_PATH=/path/to/drivers make packer-service_Vllm

GPUs can be added to the VM using:

  • PCI Passthrough
  • SR-IOV vGPUs

Some configurations may require downloading proprietary drivers and configuring associated licenses. Note: When using NVIDIA cards, select a profile that supports OpenCL and CUDA applications (e.g., Q-series vGPU types).

After deployment, the application should utilize the GPU resources, as verified using nvidia-smi:

root@vllm:~# nvidia-smi

Tue Dec 31 15:28:25 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01             Driver Version: 535.216.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10-24Q                 On  | 00000000:01:01.0 Off |                    0 |
| N/A   N/A    P8              N/A /  N/A |   6259MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2286      C   vllm::ServeReplica:app1:ChatBot             6257MiB |
+---------------------------------------------------------------------------------------+