ray_feature - OpenNebula/one-apps GitHub Wiki
Features and Usage
The appliance includes two model serving tools:
-
Ray and its Ray Serve library, which enable scalable deployment of inference APIs across distributed systems.
-
vLLM, a high-performance, open-source inference engine optimized for serving large language models with low latency and high throughput.
This appliance streamlines the deployment of model-serving applications and integrates seamlessly with models available on Hugging Face (note: a Hugging Face account and token may be required for access to some models). It includes a default application to serve the selected model, along with a web-based client for interactive inference.
Contextualization
The appliance's behavior and configuration are controlled by contextualization parameters specified in the VM template's Context Section. Below are the primary configurable aspects:
Inference Deployment Framework
You can select Ray or VLLM to deploy the inference application and model.
Parameter | Default | Description |
---|---|---|
ONEAPP_RAY_AI_FRAMEWORK |
RAY |
Select the framework to serve the LLM model (RAY or VLLM ). |
Inference API
The model will be exposed through an API that can be consumed by your application, you can control it with the following parameters. You can also deploy a web application to interact with the model.
Parameter | Default | Description |
---|---|---|
ONEAPP_RAY_API_OPENAI |
NO |
Expose the LLM model through an OpenAI-compatible API. |
ONEAPP_RAY_API_PORT |
8000 |
Port number for the API endpoint. |
ONEAPP_RAY_API_WEB |
YES |
Deploy a web application to interact with the model. |
Inference Model
Parameter | Default | Description |
---|---|---|
ONEAPP_RAY_MAX_NEW_TOKENS |
1024 |
Maximum number of tokens (input + output) allowed per inference request. |
ONEAPP_RAY_MODEL_ID |
meta-llama/Llama-3.2-3B-Instruct |
Determines the LLM model to use for inference. |
ONEAPP_RAY_MODEL_PROMPT |
You are a helpful assistant. Answer the question. |
Starting directive for model responses (ignored when using OpenAI API). |
ONEAPP_RAY_MODEL_QUANTIZATION |
0 |
Reduce memory usage and improve inference speed by compressing LLM weights (0 , 4 , 8 ). |
ONEAPP_RAY_MODEL_TEMPERATURE |
0.1 |
Temperature for model outputs, controlling randomness (ignored when using OpenAI API). |
ONEAPP_RAY_MODEL_TOKEN |
- |
Hugging Face token to access the specified LLM model. |
Configuration Files
To achieve full control over the application setup, you can provide a configuration file for the Ray Serve application. Refer to the Ray Serve documentation for detailed a description. Use the following parameter to configure this:
Parameter | Default | Description |
---|---|---|
ONEAPP_RAY_CONFIG_FILE64 |
- | Base64-encoded configuration file for the Serve application. |
Using the Appliance from the CLI
Ray and vLLM are installed in a virtual environment, you will need to activate it before using the ray
or vllm
applications:
root@RayLLM:~# . ./ray_env/bin/activate
(ray_env) root@RayLLM:~# serve status
proxies:
bc4415f9642389391dc17ef2364a8a4b4c3dee46bde15b3bc9f0cc35: HEALTHY
applications:
app1:
status: RUNNING
...
Using GPUs
The appliance is designed to utilize all available CPU and GPU resources in the VM by default. However, GPU drivers are not pre-installed. To use GPUs, the appropriate drivers must be installed. GPUs can be added to the VM using:
- PCI Passthrough
- SR-IOV vGPUs
Some configurations may require downloading proprietary drivers and configuring associated licenses. Note: When using NVIDIA cards, select a profile that supports OpenCL and CUDA applications (e.g., Q-series vGPU types).
After deployment, the application should utilize the GPU resources, as verified using nvidia-smi
:
root@ray-app-28245:~# nvidia-smi
Tue Dec 31 15:28:25 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01 Driver Version: 535.216.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10-24Q On | 00000000:01:01.0 Off | 0 |
| N/A N/A P8 N/A / N/A | 6259MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2286 C ray::ServeReplica:app1:ChatBot 6257MiB |
+---------------------------------------------------------------------------------------+