dyn_quick - aleixrm/one-apps GitHub Wiki
The Dynamo appliance includes a built-in chat application that can be easily deployed using a pre-trained model. This guide shows how to deploy and serve this application:
-
Download the Appliance Retrieve the Dynamo appliance from the OpenNebula marketplace using the following command:
$ onemarketapp export 'Service Dynamo' Dynamo --datastore default
-
(Optional) Configure the Dynamo VM Template Depending on your specific application requirements, you may need to modify the VM template to adjust resources such as
vCPU
orMEMORY
, or to add GPU cards for enhanced model serving capabilities. -
Instantiate the Template Upon instantiation, you will be prompted to configure model-specific parameters, such as the model ID or the inference engine to use, as well as provide your Hugging Face token if required. For example, deploying the
Qwen2.5-0.5B-Instruct
model and themistralrs
engine, will result in the followingCONTEXT
and capacity attributes:MEMORY="8192" CPU="2" VCPU="4" ... CONTEXT=[ DISK_ID="1", ETH0_DNS="172.20.0.1", ... ONEAPP_DYNAMO_API_PORT= "8000", ONEAPP_DYNAMO_MODEL_ID= "Qwen/Qwen2.5-0.5B-Instruct", ONEAPP_DYNAMO_ENGINE_NAME= "mistralrs", ... ]
-
Deploy the Application The deployment process may take several minutes as it downloads the model. You can monitor the status by logging into the VM:
- Access the VM via SSH:
$ onevm ssh dynamo-chatbot Warning: Permanently added '172.20.0.5' (ED25519) to the list of known hosts. Welcome to Ubuntu 24.04.2 LTS (GNU/Linux 6.8.0-60-generic x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/pro System information as of Thu May 15 18:00:55 UTC 2025 System load: 0.03 Processes: 187 Usage of /: 59.1% of 23.17GB Users logged in: 0 Memory usage: 18% IPv4 address for eth0: 172.20.0.5 Swap usage: 0% Expanded Security Maintenance for Applications is not enabled. 0 updates can be applied immediately. Enable ESM Apps to receive additional future security updates. See https://ubuntu.com/esm or run: sudo pro status ___ _ __ ___ / _ \ | '_ \ / _ \ OpenNebula Service Appliance |(_)| | | | || __/ \___/ |_| |_| \___| All set and ready to serve 8)
Verify the Dynamo Run service Status:
root@dynamo-chatbot:~# tail -f /var/log/one-appliance/configure.log
You should see an output similar to this:
I, [2025-05-15T17:31:14.577203 #1855] INFO -- : Dynamo::configure I, [2025-05-15T17:31:14.577227 #1855] INFO -- : Starting Dynamo... I, [2025-05-15T17:31:14.580879 #1855] INFO -- : Starting Dynamo Web App... I, [2025-05-15T17:31:14.585500 #1855] INFO -- : Dynamo web app running at http://localhost:5001 I, [2025-05-15T17:31:14.585567 #1855] INFO -- : Configuration completed successfully * Serving Flask app 'web_client' * Debug mode: on WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. * Running on all addresses (0.0.0.0) * Running on http://127.0.0.1:5001 * Running on http://172.20.0.5:5001 Press CTRL+C to quit 2025-05-15T17:31:17.023Z INFO dynamo_run: CPU mode. Rebuild with `--features cuda|metal|vulkan` for better performance 2025-05-15T17:31:17.355Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775` 2025-05-15T17:31:17.356Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` locally at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/tokenizer.json` 2025-05-15T17:31:17.356Z INFO mistralrs_core::pipeline::normal: Loading `config.json` at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775` 2025-05-15T17:31:17.356Z INFO mistralrs_core::pipeline::normal: Loading `config.json` locally at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/config.json` 2025-05-15T17:31:17.360Z INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"] 2025-05-15T17:31:17.361Z INFO mistralrs_core::pipeline::paths: Loading `model.safetensors` locally at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/model.safetensors` 2025-05-15T17:31:17.361Z INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775` 2025-05-15T17:31:17.361Z INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` locally at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/generation_config.json` 2025-05-15T17:31:17.361Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775` 2025-05-15T17:31:17.361Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` locally at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/tokenizer_config.json` 2025-05-15T17:31:17.361Z INFO mistralrs_core::pipeline::normal: Prompt chunk size is 512. 2025-05-15T17:31:17.367Z INFO mistralrs_core::utils::normal: DType selected is F16. 2025-05-15T17:31:17.369Z INFO mistralrs_core::utils::log: Automatic loader type determined to be `qwen2` 2025-05-15T17:31:17.415Z INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1]. 2025-05-15T17:31:17.416Z INFO mistralrs_core::utils::log: Model has 24 repeating layers. 2025-05-15T17:31:17.416Z INFO mistralrs_core::utils::log: Loading model according to the following repeating layer mappings: 2025-05-15T17:31:17.422Z INFO mistralrs_core::utils::log: Layers 0-23: cpu (8 GB) 2025-05-15T17:31:17.427Z INFO mistralrs_core::utils::normal: DType selected is F16. 2025-05-15T17:31:17.427Z WARN mistralrs_core::pipeline::normal: Device mapping contains a mix of GPU and CPU. There is no CPU support for PagedAttention, disabling PagedAttention. 2025-05-15T17:31:17.427Z INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 151936, hidden_size: 896, intermediate_size: 4864, num_hidden_layers: 24, num_attention_heads: 14, num_key_value_heads: 2, max_position_embeddings: 32768, sliding_window: 32768, rope_theta: 1000000.0, rms_norm_eps: 1e-6, hidden_act: Silu, use_flash_attn: false, quantization_config: None, tie_word_embeddings: true } 100%|████████████████████████████████████████| 290/290 [00:04<00:00, 215.20it/s] 2025-05-15T17:31:23.422Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|endoftext|>", eos_toks = "<|im_end|>", "<|endoftext|>", unk_tok = `None` 2025-05-15T17:31:23.447Z INFO mistralrs_core: Beginning dummy run. 2025-05-15T17:31:23.448Z INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled! Expect higher multi-turn prompt throughput. 2025-05-15T17:31:27.051Z INFO mistralrs_core: Dummy run completed in 3.603769851s. 2025-05-15T17:31:27.064Z INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000 address="0.0.0.0:8000" I, [2025-05-15T17:33:40.704589 #1847] INFO -- : Dynamo::configure I, [2025-05-15T17:33:40.704613 #1847] INFO -- : Starting Dynamo... I, [2025-05-15T17:33:40.708244 #1847] INFO -- : Starting Dynamo Web App... I, [2025-05-15T17:33:40.711826 #1847] INFO -- : Dynamo web app running at http://localhost:5001 I, [2025-05-15T17:33:40.711909 #1847] INFO -- : Configuration completed successfully * Serving Flask app 'web_client' * Debug mode: on WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. * Running on all addresses (0.0.0.0) * Running on http://127.0.0.1:5001 * Running on http://172.20.0.5:5001 Press CTRL+C to quit
-
Test the Inference Endpoint If
one-gate
is enabled in your OpenNebula installation, the inference endpoint URL should be added to the VM information. Alternatively, you can use the VM's IP address and port8000
:$ onevm show dynamo-chatbot | grep DYNAMO.*CHATBOT ONEAPP_DYNAMO_CHATBOT_URL="http://172.20.0.3:8000/v1/chat/completions"
We can access the chatbot API directly for instance with curl (note that we should indicate the model name correctly):
user@opennebula:~$ curl 172.20.0.5:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "Qwen2.5-0.5B-Instruct", "messages": [ { "role": "user", "content": "Which is the capital of Spain?" } ], "stream":false, "max_tokens": 300 }' | jq
We should get something like the following json response with the answer in the message field:
{ "id": "3", "choices": [ { "index": 0, "message": { "content": "The capital of Spain is Madrid.", "refusal": null, "tool_calls": null, "role": "assistant", "function_call": null, "audio": null }, "finish_reason": "stop", "logprobs": null } ], "created": 3575561985, "model": "/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775", "service_tier": null, "system_fingerprint": "local", "object": "chat.completion", "usage": null }
Alternatively you can use a web browser to access the built-in web interface, just point it to the
ONEAPP_DYNAMO_CHATBOT_WEB
URL, in this examplehttp://172.20.0.5:5000