dyn_quick - OpenNebula/one-apps GitHub Wiki
The Dynamo appliance includes a built-in chat application that can be easily deployed using a pre-trained model. This guide shows how to deploy and serve this application:
- 
Download the Appliance Retrieve the Dynamo appliance from the OpenNebula marketplace using the following command:
$ onemarketapp export 'Service Dynamo' Dynamo --datastore default
 - 
(Optional) Configure the Dynamo VM Template Depending on your specific application requirements, you may need to modify the VM template to adjust resources such as
vCPUorMEMORY, or to add GPU cards for enhanced model serving capabilities. - 
Instantiate the Template Upon instantiation, you will be prompted to configure model-specific parameters, such as the model ID or the inference engine to use, as well as provide your Hugging Face token if required. For example, deploying the
Qwen2.5-0.5B-Instructmodel and themistralrsengine, will result in the followingCONTEXTand capacity attributes:MEMORY="8192" CPU="2" VCPU="4" ... CONTEXT=[ DISK_ID="1", ETH0_DNS="172.20.0.1", ... ONEAPP_DYNAMO_API_PORT= "8000", ONEAPP_DYNAMO_MODEL_ID= "Qwen/Qwen2.5-0.5B-Instruct", ONEAPP_DYNAMO_ENGINE_NAME= "mistralrs", ... ]
 - 
Deploy the Application The deployment process may take several minutes as it downloads the model. You can monitor the status by logging into the VM:
- Access the VM via SSH:
 
$ onevm ssh dynamo-chatbot Warning: Permanently added '172.20.0.5' (ED25519) to the list of known hosts. Welcome to Ubuntu 24.04.2 LTS (GNU/Linux 6.8.0-60-generic x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/pro System information as of Thu May 15 18:00:55 UTC 2025 System load: 0.03 Processes: 187 Usage of /: 59.1% of 23.17GB Users logged in: 0 Memory usage: 18% IPv4 address for eth0: 172.20.0.5 Swap usage: 0% Expanded Security Maintenance for Applications is not enabled. 0 updates can be applied immediately. Enable ESM Apps to receive additional future security updates. See https://ubuntu.com/esm or run: sudo pro status ___ _ __ ___ / _ \ | '_ \ / _ \ OpenNebula Service Appliance |(_)| | | | || __/ \___/ |_| |_| \___| All set and ready to serve 8)
Verify the Dynamo Run service Status:
root@dynamo-chatbot:~# tail -f /var/log/one-appliance/configure.log
You should see an output similar to this:
I, [2025-05-15T17:31:14.577203 #1855] INFO -- : Dynamo::configure I, [2025-05-15T17:31:14.577227 #1855] INFO -- : Starting Dynamo... I, [2025-05-15T17:31:14.580879 #1855] INFO -- : Starting Dynamo Web App... I, [2025-05-15T17:31:14.585500 #1855] INFO -- : Dynamo web app running at http://localhost:5001 I, [2025-05-15T17:31:14.585567 #1855] INFO -- : Configuration completed successfully * Serving Flask app 'web_client' * Debug mode: on WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. * Running on all addresses (0.0.0.0) * Running on http://127.0.0.1:5001 * Running on http://172.20.0.5:5001 Press CTRL+C to quit 2025-05-15T17:31:17.023Z INFO dynamo_run: CPU mode. Rebuild with `--features cuda|metal|vulkan` for better performance 2025-05-15T17:31:17.355Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775` 2025-05-15T17:31:17.356Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` locally at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/tokenizer.json` 2025-05-15T17:31:17.356Z INFO mistralrs_core::pipeline::normal: Loading `config.json` at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775` 2025-05-15T17:31:17.356Z INFO mistralrs_core::pipeline::normal: Loading `config.json` locally at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/config.json` 2025-05-15T17:31:17.360Z INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"] 2025-05-15T17:31:17.361Z INFO mistralrs_core::pipeline::paths: Loading `model.safetensors` locally at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/model.safetensors` 2025-05-15T17:31:17.361Z INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775` 2025-05-15T17:31:17.361Z INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` locally at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/generation_config.json` 2025-05-15T17:31:17.361Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775` 2025-05-15T17:31:17.361Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` locally at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/tokenizer_config.json` 2025-05-15T17:31:17.361Z INFO mistralrs_core::pipeline::normal: Prompt chunk size is 512. 2025-05-15T17:31:17.367Z INFO mistralrs_core::utils::normal: DType selected is F16. 2025-05-15T17:31:17.369Z INFO mistralrs_core::utils::log: Automatic loader type determined to be `qwen2` 2025-05-15T17:31:17.415Z INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1]. 2025-05-15T17:31:17.416Z INFO mistralrs_core::utils::log: Model has 24 repeating layers. 2025-05-15T17:31:17.416Z INFO mistralrs_core::utils::log: Loading model according to the following repeating layer mappings: 2025-05-15T17:31:17.422Z INFO mistralrs_core::utils::log: Layers 0-23: cpu (8 GB) 2025-05-15T17:31:17.427Z INFO mistralrs_core::utils::normal: DType selected is F16. 2025-05-15T17:31:17.427Z WARN mistralrs_core::pipeline::normal: Device mapping contains a mix of GPU and CPU. There is no CPU support for PagedAttention, disabling PagedAttention. 2025-05-15T17:31:17.427Z INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 151936, hidden_size: 896, intermediate_size: 4864, num_hidden_layers: 24, num_attention_heads: 14, num_key_value_heads: 2, max_position_embeddings: 32768, sliding_window: 32768, rope_theta: 1000000.0, rms_norm_eps: 1e-6, hidden_act: Silu, use_flash_attn: false, quantization_config: None, tie_word_embeddings: true } 100%|████████████████████████████████████████| 290/290 [00:04<00:00, 215.20it/s] 2025-05-15T17:31:23.422Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|endoftext|>", eos_toks = "<|im_end|>", "<|endoftext|>", unk_tok = `None` 2025-05-15T17:31:23.447Z INFO mistralrs_core: Beginning dummy run. 2025-05-15T17:31:23.448Z INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled! Expect higher multi-turn prompt throughput. 2025-05-15T17:31:27.051Z INFO mistralrs_core: Dummy run completed in 3.603769851s. 2025-05-15T17:31:27.064Z INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000 address="0.0.0.0:8000" I, [2025-05-15T17:33:40.704589 #1847] INFO -- : Dynamo::configure I, [2025-05-15T17:33:40.704613 #1847] INFO -- : Starting Dynamo... I, [2025-05-15T17:33:40.708244 #1847] INFO -- : Starting Dynamo Web App... I, [2025-05-15T17:33:40.711826 #1847] INFO -- : Dynamo web app running at http://localhost:5001 I, [2025-05-15T17:33:40.711909 #1847] INFO -- : Configuration completed successfully * Serving Flask app 'web_client' * Debug mode: on WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. * Running on all addresses (0.0.0.0) * Running on http://127.0.0.1:5001 * Running on http://172.20.0.5:5001 Press CTRL+C to quit - 
Test the Inference Endpoint If
one-gateis enabled in your OpenNebula installation, the inference endpoint URL should be added to the VM information. Alternatively, you can use the VM's IP address and port8000:$ onevm show dynamo-chatbot | grep DYNAMO.*CHATBOT ONEAPP_DYNAMO_CHATBOT_URL="http://172.20.0.3:8000/v1/chat/completions"
We can access the chatbot API directly for instance with curl (note that we should indicate the model name correctly):
user@opennebula:~$ curl 172.20.0.5:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "Qwen2.5-0.5B-Instruct", "messages": [ { "role": "user", "content": "Which is the capital of Spain?" } ], "stream":false, "max_tokens": 300 }' | jqWe should get something like the following json response with the answer in the message field:
{ "id": "3", "choices": [ { "index": 0, "message": { "content": "The capital of Spain is Madrid.", "refusal": null, "tool_calls": null, "role": "assistant", "function_call": null, "audio": null }, "finish_reason": "stop", "logprobs": null } ], "created": 3575561985, "model": "/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775", "service_tier": null, "system_fingerprint": "local", "object": "chat.completion", "usage": null }Alternatively you can use a web browser to access the built-in web interface, just point it to the
ONEAPP_DYNAMO_CHATBOT_WEBURL, in this examplehttp://172.20.0.5:5000