vllm_feature - OpenNebula/one-apps GitHub Wiki
Features and Usage
The appliance includes vLLM, an open-source, high-performance inference engine designed for serving large language models with low latency and high throughput.
This appliance streamlines the deployment of model-serving applications and integrates seamlessly with models available on Hugging Face (note: a Hugging Face account and token may be required for access to some models). It includes a default application to serve the selected model, along with a web-based client for interactive inference.
Contextualization
The appliance's behavior and configuration are controlled by contextualization parameters specified in the VM template's Context Section. Below are the primary configurable aspects:
Inference API
The model will be exposed through an API that can be consumed by your application, you can control it with the following parameters. You can also deploy a web application to interact with the model.
| Parameter | Default | Description |
|---|---|---|
ONEAPP_VLLM_API_PORT |
8000 |
Port number for the API endpoint. |
ONEAPP_VLLM_API_WEB |
YES |
Deploy a web application to interact with the model. |
Inference Model
| Parameter | Default | Description |
|---|---|---|
ONEAPP_VLLM_MODEL_ID |
Qwen/Qwen2.5-1.5B-Instruct |
Determines the LLM model to use for inference. |
ONEAPP_VLLM_MODEL_MAX_LENGTH |
1024 |
Maximum number of tokens (input + output) allowed per inference request. |
ONEAPP_VLLM_MODEL_QUANTIZATION |
0 |
Reduce memory usage and improve inference speed by compressing LLM weights (0, 4). |
ONEAPP_VLLM_ENFORCE_EAGER |
NO |
Whether to always use eager-mode PyTorch in vllm. |
ONEAPP_VLLM_SLEEP_MODE |
NO |
Whether to enable sleep mode when GPU is used. |
ONEAPP_VLLM_GPU_MEMORY_UTILIZATION |
0.9 |
Fraction of GPU memory to use (from 0.1 to 1.0). |
ONEAPP_VLLM_MODEL_TOKEN |
- |
Hugging Face token to access the specified LLM model. |
Using the Appliance from the CLI
vLLM is installed in a virtual environment. Depending on if you want to use the CPU or GPU for inference, you will need to activate the corresponding venv before using the vllm application (vllm_cpu_env for cpu usage and vllm_gpu_env for gpu usage), e.g. :
root@vLLM:~# source /root/vllm_gpu_env/bin/activate
(vllm_gpu_env) root@vLLM:~# vllm serve Qwen/Qwen2.5-1.5B-Instruct
Using GPUs
The appliance is designed to utilize all available CPU and GPU resources in the VM by default. By default, the NVIDIA drivers for the branch 570 are installed. Optionally, you can install other driver versions during the appliance build process using the NVIDIA_DRIVER_PATH environment variable when executing the Vllm appliance build Makefile recipe. This variable could contain an URL or a local path with the drivers package, e.g.:
sudo NVIDIA_DRIVER_PATH=/path/to/drivers make packer-service_Vllm
GPUs can be added to the VM using:
- PCI Passthrough
- SR-IOV vGPUs
Some configurations may require downloading proprietary drivers and configuring associated licenses. Note: When using NVIDIA cards, select a profile that supports OpenCL and CUDA applications (e.g., Q-series vGPU types).
After deployment, the application should utilize the GPU resources, as verified using nvidia-smi:
root@vllm:~# nvidia-smi
Tue Dec 31 15:28:25 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01 Driver Version: 535.216.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10-24Q On | 00000000:01:01.0 Off | 0 |
| N/A N/A P8 N/A / N/A | 6259MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2286 C vllm::ServeReplica:app1:ChatBot 6257MiB |
+---------------------------------------------------------------------------------------+