vllm_quick - OpenNebula/one-apps GitHub Wiki

Quick Start

The vLLM appliance includes a built-in chat application that can be easily deployed using a pre-trained model. This guide shows how to deploy and serve this application:

Download the Appliance Retrieve the vLLM appliance from the OpenNebula marketplace using the following command:
```
$ onemarketapp export 'service_Vllm' service_Vllm --datastore default
```
(Optional) Configure the vLLM VM Template Depending on your specific application requirements, you may need to modify the VM template to adjust resources such as vCPU or MEMORY, or to add GPU cards for enhanced model serving capabilities.

Instantiate the Template Upon instantiation, you will be prompted to configure model-specific parameters, such as the model ID and temperature, as well as provide your Hugging Face token if required. For example, deploying the Qwen2.5-1.5B-Instruct model results in the following CONTEXT and capacity attributes:

MEMORY="8192"
VCPU="4"
...
CONTEXT=[
  DISK_ID="1",
  ETH0_DNS="172.20.0.1",
  ...
  ONEAPP_VLLM_API_PORT="8000"
  ONEAPP_VLLM_API_WEB="YES"
  ONEAPP_VLLM_ENFORCE_EAGER="NO"
  ONEAPP_VLLM_GPU_MEMORY_UTILIZATION="0.9"
  ONEAPP_VLLM_MODEL_ID="Qwen/Qwen2.5-1.5B-Instruct"
  ONEAPP_VLLM_MODEL_MAX_LENGTH="1024"
  ONEAPP_VLLM_MODEL_QUANTIZATION="0"
  ONEAPP_VLLM_MODEL_TOKEN="hf_eDJEEeq*****************"
  ONEAPP_VLLM_SLEEP_MODE="NO"
  ...
]

Note: The number of CPUs allocated to the application is automatically derived from the available virtual CPUs.

Deploy the Application The deployment process may take several minutes as it downloads the model and required dependencies (e.g., PyTorch and FastAPI). You can monitor the status by logging into the VM:

Access the VM via SSH:

$ onevm ssh 71
Warning: Permanently added '172.20.0.5' (ED25519) to the list of known hosts.
Welcome to Ubuntu 22.04.5 LTS (GNU/Linux 5.15.0-127-generic x86_64)

* Documentation:  https://help.ubuntu.com
* Management:     https://landscape.canonical.com
* Support:        https://ubuntu.com/pro

System information as of Thu Jan  2 12:01:28 UTC 2025

System load:  0.16               Processes:             130
Usage of /:   10.5% of 96.73GB   Users logged in:       0
Memory usage: 89%                IPv4 address for eth0: 172.20.0.5
Swap usage:   0%

Expanded Security Maintenance for Applications is not enabled.

8 updates can be applied immediately.
To see these additional updates run: apt list --upgradable

Enable ESM Apps to receive additional future security updates.
See https://ubuntu.com/esm or run: sudo pro status
      ___   _ __    ___
     / _ \ | '_ \  / _ \   OpenNebula Service Appliance
    | (_) || | | ||  __/
     \___/ |_| |_| \___|

 All set and ready to serve 8)

Verify the vLLM Engine Status:

root@vllm:~# tail -f /var/log/one-appliance/vllm.log

You should see an output similar to this:

  [...]

  (EngineCore_DP0 pid=2420) INFO 10-17 10:04:26 [gpu_worker.py:391] Free memory on device (92.49/93.1 GiB) on startup. Desired GPU memory utilization is (0.9, 83.79 GiB). Actual usage is 2.89 GiB for weight, 5.57 GiB for peak activation, 0.07 GiB for non-torch memory, and 0.58 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=80035307622` to fit into requested memory, or `--kv-cache-memory=89384516096` to fully utilize gpu memory. Current kv cache memory in use is 80811253862 bytes.
  (EngineCore_DP0 pid=2420) INFO 10-17 10:04:26 [core.py:218] init engine (profile, create kv cache, warmup model) took 10.79 seconds
  (APIServer pid=2225) INFO 10-17 10:04:27 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 176154
  (APIServer pid=2225) INFO 10-17 10:04:27 [async_llm.py:180] Torch profiler disabled. AsyncLLM CPU traces will not be collected.
  (APIServer pid=2225) INFO 10-17 10:04:28 [api_server.py:1692] Supported_tasks: ['generate']
  (APIServer pid=2225) WARNING 10-17 10:04:28 [__init__.py:1695] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
  (APIServer pid=2225) INFO 10-17 10:04:28 [serving_responses.py:130] Using default chat sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
  (APIServer pid=2225) INFO 10-17 10:04:28 [serving_chat.py:137] Using default chat sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
  (APIServer pid=2225) INFO 10-17 10:04:28 [serving_completion.py:76] Using default completion sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
  (APIServer pid=2225) INFO 10-17 10:04:28 [api_server.py:1971] Starting vLLM API server 0 on http://0.0.0.0:8000

  [...]

Test the Inference Endpoint

If one-gate is enabled in your OpenNebula installation, the inference endpoint URL should be added to the VM information. Alternatively, you can use the VM's IP address and port 8000:
```
$ onevm show 71 | grep VLLM.*CHATBOT
ONEAPP_VLLM_CHATBOT_API="http://172.20.0.5:8000/chat"
ONEAPP_VLLM_CHATBOT_WEB="http://172.20.0.5:5000"
```
You can use a web browser to access the built-in web interface, just point it to the ONEAPP_VLLM_CHATBOT_WEB URL, in this example http://172.20.0.5:5000

/images/vllm_web.png