ray_feature - OpenNebula/one-apps GitHub Wiki

Features and Usage

The appliance deploys the selected LLM using Ray for distributing and optimizing computing but it includes two different frameworks to run the LLMs:

Hugging Face - Transformers Hugging Face’s Transformers library is one of the most widely adopted frameworks for deploying large language models, offering extensive compatibility and customization options for downloaded models.
vLLM An open-source, high-performance inference engine designed for serving large language models with low latency and high throughput.

This appliance streamlines the deployment of model-serving applications and integrates seamlessly with models available on Hugging Face (note: a Hugging Face account and token may be required for access to some models). It includes a default application to serve the selected model, along with a web-based client for interactive inference.

Contextualization

The appliance's behavior and configuration are controlled by contextualization parameters specified in the VM template's Context Section. Below are the primary configurable aspects:

Inference Deployment Framework

You can select Ray or VLLM to deploy the inference application and model.

Parameter	Default	Description
`ONEAPP_RAY_AI_FRAMEWORK`	`RAY`	Select the framework to serve the LLM model (`RAY` or `VLLM`).

Inference API

The model will be exposed through an API that can be consumed by your application, you can control it with the following parameters. You can also deploy a web application to interact with the model.

Parameter	Default	Description
`ONEAPP_RAY_API_OPENAI`	`NO`	Expose the LLM model through an OpenAI-compatible API.
`ONEAPP_RAY_API_PORT`	`8000`	Port number for the API endpoint.
`ONEAPP_RAY_API_WEB`	`YES`	Deploy a web application to interact with the model.

Inference Model

Parameter	Default	Description
`ONEAPP_RAY_MAX_NEW_TOKENS`	`1024`	Maximum number of tokens (input + output) allowed per inference request.
`ONEAPP_RAY_MODEL_ID`	`meta-llama/Llama-3.2-3B-Instruct`	Determines the LLM model to use for inference.
`ONEAPP_RAY_MODEL_PROMPT`	`You are a helpful assistant. Answer the question.`	Starting directive for model responses (ignored when using OpenAI API).
`ONEAPP_RAY_MODEL_QUANTIZATION`	`0`	Reduce memory usage and improve inference speed by compressing LLM weights (`0`, `4`, `8`).
`ONEAPP_RAY_MODEL_TEMPERATURE`	`0.1`	Temperature for model outputs, controlling randomness (ignored when using OpenAI API).
`ONEAPP_RAY_MODEL_TOKEN`	`-`	Hugging Face token to access the specified LLM model.

Configuration Files

To achieve full control over the application setup, you can provide a configuration file for the Ray Serve application. Refer to the Ray Serve documentation for detailed a description. Use the following parameter to configure this:

Parameter	Default	Description
`ONEAPP_RAY_CONFIG_FILE64`	-	Base64-encoded configuration file for the Serve application.

Using the Appliance from the CLI

Ray and vLLM are installed in a virtual environment, you will need to activate it before using the ray or vllm applications:

root@RayLLM:~# . ./ray_env/bin/activate

(ray_env) root@RayLLM:~# serve status
proxies:
  bc4415f9642389391dc17ef2364a8a4b4c3dee46bde15b3bc9f0cc35: HEALTHY
applications:
  app1:
    status: RUNNING
...

Using GPUs

The appliance is designed to utilize all available CPU and GPU resources in the VM by default. However, GPU drivers are not pre-installed. To use GPUs, the appropriate drivers must be installed. You can install them during the appliance build process using the NVIDIA_DRIVER_PATH environment variable when executing the Ray appliance build Makefile recipe. This variable could contain an URL or a local path with the drivers package, e.g.:

sudo NVIDIA_DRIVER_PATH=/path/to/drivers make packer-service_Ray

GPUs can be added to the VM using:

PCI Passthrough
SR-IOV vGPUs

Some configurations may require downloading proprietary drivers and configuring associated licenses. Note: When using NVIDIA cards, select a profile that supports OpenCL and CUDA applications (e.g., Q-series vGPU types).

After deployment, the application should utilize the GPU resources, as verified using nvidia-smi:

root@ray-app-28245:~# nvidia-smi
Tue Dec 31 15:28:25 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01             Driver Version: 535.216.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10-24Q                 On  | 00000000:01:01.0 Off |                    0 |
| N/A   N/A    P8              N/A /  N/A |   6259MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2286      C   ray::ServeReplica:app1:ChatBot             6257MiB |
+---------------------------------------------------------------------------------------+