ray_feature - OpenNebula/one-apps GitHub Wiki

Features and Usage

The appliance includes two model serving tools:

  • Ray and its Ray Serve library, which enable scalable deployment of inference APIs across distributed systems.

  • vLLM, a high-performance, open-source inference engine optimized for serving large language models with low latency and high throughput.

This appliance streamlines the deployment of model-serving applications and integrates seamlessly with models available on Hugging Face (note: a Hugging Face account and token may be required for access to some models). It includes a default application to serve the selected model, along with a web-based client for interactive inference.

Contextualization

The appliance's behavior and configuration are controlled by contextualization parameters specified in the VM template's Context Section. Below are the primary configurable aspects:

Inference Deployment Framework

You can select Ray or VLLM to deploy the inference application and model.

Parameter Default Description
ONEAPP_RAY_AI_FRAMEWORK RAY Select the framework to serve the LLM model (RAY or VLLM).

Inference API

The model will be exposed through an API that can be consumed by your application, you can control it with the following parameters. You can also deploy a web application to interact with the model.

Parameter Default Description
ONEAPP_RAY_API_OPENAI NO Expose the LLM model through an OpenAI-compatible API.
ONEAPP_RAY_API_PORT 8000 Port number for the API endpoint.
ONEAPP_RAY_API_WEB YES Deploy a web application to interact with the model.

Inference Model

Parameter Default Description
ONEAPP_RAY_MAX_NEW_TOKENS 1024 Maximum number of tokens (input + output) allowed per inference request.
ONEAPP_RAY_MODEL_ID meta-llama/Llama-3.2-3B-Instruct Determines the LLM model to use for inference.
ONEAPP_RAY_MODEL_PROMPT You are a helpful assistant. Answer the question. Starting directive for model responses (ignored when using OpenAI API).
ONEAPP_RAY_MODEL_QUANTIZATION 0 Reduce memory usage and improve inference speed by compressing LLM weights (0, 4, 8).
ONEAPP_RAY_MODEL_TEMPERATURE 0.1 Temperature for model outputs, controlling randomness (ignored when using OpenAI API).
ONEAPP_RAY_MODEL_TOKEN - Hugging Face token to access the specified LLM model.

Configuration Files

To achieve full control over the application setup, you can provide a configuration file for the Ray Serve application. Refer to the Ray Serve documentation for detailed a description. Use the following parameter to configure this:

Parameter Default Description
ONEAPP_RAY_CONFIG_FILE64 - Base64-encoded configuration file for the Serve application.

Using the Appliance from the CLI

Ray and vLLM are installed in a virtual environment, you will need to activate it before using the ray or vllm applications:

root@RayLLM:~# . ./ray_env/bin/activate

(ray_env) root@RayLLM:~# serve status
proxies:
  bc4415f9642389391dc17ef2364a8a4b4c3dee46bde15b3bc9f0cc35: HEALTHY
applications:
  app1:
    status: RUNNING
...

Using GPUs

The appliance is designed to utilize all available CPU and GPU resources in the VM by default. However, GPU drivers are not pre-installed. To use GPUs, the appropriate drivers must be installed. GPUs can be added to the VM using:

  • PCI Passthrough
  • SR-IOV vGPUs

Some configurations may require downloading proprietary drivers and configuring associated licenses. Note: When using NVIDIA cards, select a profile that supports OpenCL and CUDA applications (e.g., Q-series vGPU types).

After deployment, the application should utilize the GPU resources, as verified using nvidia-smi:

root@ray-app-28245:~# nvidia-smi
Tue Dec 31 15:28:25 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01             Driver Version: 535.216.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10-24Q                 On  | 00000000:01:01.0 Off |                    0 |
| N/A   N/A    P8              N/A /  N/A |   6259MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2286      C   ray::ServeReplica:app1:ChatBot             6257MiB |
+---------------------------------------------------------------------------------------+