dyn_quick - aleixrm/one-apps GitHub Wiki

Quick Start

The Dynamo appliance includes a built-in chat application that can be easily deployed using a pre-trained model. This guide shows how to deploy and serve this application:

  1. Download the Appliance Retrieve the Dynamo appliance from the OpenNebula marketplace using the following command:

    $ onemarketapp export 'Service Dynamo' Dynamo --datastore default
  2. (Optional) Configure the Dynamo VM Template Depending on your specific application requirements, you may need to modify the VM template to adjust resources such as vCPU or MEMORY, or to add GPU cards for enhanced model serving capabilities.

  3. Instantiate the Template Upon instantiation, you will be prompted to configure model-specific parameters, such as the model ID or the inference engine to use, as well as provide your Hugging Face token if required. For example, deploying the Qwen2.5-0.5B-Instruct model and the mistralrs engine, will result in the following CONTEXT and capacity attributes:

    MEMORY="8192"
    CPU="2"
    VCPU="4"
    ...
    CONTEXT=[
      DISK_ID="1",
      ETH0_DNS="172.20.0.1",
      ...
      ONEAPP_DYNAMO_API_PORT= "8000",
      ONEAPP_DYNAMO_MODEL_ID= "Qwen/Qwen2.5-0.5B-Instruct",
      ONEAPP_DYNAMO_ENGINE_NAME= "mistralrs",
      ...
    ]
  4. Deploy the Application The deployment process may take several minutes as it downloads the model. You can monitor the status by logging into the VM:

    • Access the VM via SSH:
    $ onevm ssh dynamo-chatbot
    Warning: Permanently added '172.20.0.5' (ED25519) to the list of known hosts.
    Welcome to Ubuntu 24.04.2 LTS (GNU/Linux 6.8.0-60-generic x86_64)
    
    * Documentation:  https://help.ubuntu.com
    * Management:     https://landscape.canonical.com
    * Support:        https://ubuntu.com/pro
    
    System information as of Thu May 15 18:00:55 UTC 2025
    
    System load:  0.03               Processes:             187
    Usage of /:   59.1% of 23.17GB   Users logged in:       0
    Memory usage: 18%                IPv4 address for eth0: 172.20.0.5
    Swap usage:   0%
    
    
    Expanded Security Maintenance for Applications is not enabled.
    
    0 updates can be applied immediately.
    
    Enable ESM Apps to receive additional future security updates.
    See https://ubuntu.com/esm or run: sudo pro status
    
    
    
     ___   _ __    ___
    / _ \ | '_ \  / _ \   OpenNebula Service Appliance
    |(_)| | | | ||  __/
    \___/ |_| |_| \___|
    
    
    All set and ready to serve 8)

    Verify the Dynamo Run service Status:

    root@dynamo-chatbot:~# tail -f /var/log/one-appliance/configure.log

    You should see an output similar to this:

      I, [2025-05-15T17:31:14.577203 #1855]  INFO -- : Dynamo::configure
      I, [2025-05-15T17:31:14.577227 #1855]  INFO -- : Starting Dynamo...
      I, [2025-05-15T17:31:14.580879 #1855]  INFO -- : Starting Dynamo Web App...
      I, [2025-05-15T17:31:14.585500 #1855]  INFO -- : Dynamo web app running at http://localhost:5001
      I, [2025-05-15T17:31:14.585567 #1855]  INFO -- : Configuration completed successfully
      * Serving Flask app 'web_client'
      * Debug mode: on
      WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
      * Running on all addresses (0.0.0.0)
      * Running on http://127.0.0.1:5001
      * Running on http://172.20.0.5:5001
      Press CTRL+C to quit
      2025-05-15T17:31:17.023Z  INFO dynamo_run: CPU mode. Rebuild with `--features cuda|metal|vulkan` for better performance
      2025-05-15T17:31:17.355Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775`
      2025-05-15T17:31:17.356Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` locally at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/tokenizer.json`
      2025-05-15T17:31:17.356Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775`
      2025-05-15T17:31:17.356Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` locally at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/config.json`
      2025-05-15T17:31:17.360Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"]
      2025-05-15T17:31:17.361Z  INFO mistralrs_core::pipeline::paths: Loading `model.safetensors` locally at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/model.safetensors`
      2025-05-15T17:31:17.361Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775`
      2025-05-15T17:31:17.361Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` locally at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/generation_config.json`
      2025-05-15T17:31:17.361Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775`
      2025-05-15T17:31:17.361Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` locally at `/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/tokenizer_config.json`
      2025-05-15T17:31:17.361Z  INFO mistralrs_core::pipeline::normal: Prompt chunk size is 512.
      2025-05-15T17:31:17.367Z  INFO mistralrs_core::utils::normal: DType selected is F16.
      2025-05-15T17:31:17.369Z  INFO mistralrs_core::utils::log: Automatic loader type determined to be `qwen2`
      2025-05-15T17:31:17.415Z  INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1].
      2025-05-15T17:31:17.416Z  INFO mistralrs_core::utils::log: Model has 24 repeating layers.
      2025-05-15T17:31:17.416Z  INFO mistralrs_core::utils::log: Loading model according to the following repeating layer mappings:
      2025-05-15T17:31:17.422Z  INFO mistralrs_core::utils::log: Layers 0-23: cpu (8 GB)
      2025-05-15T17:31:17.427Z  INFO mistralrs_core::utils::normal: DType selected is F16.
      2025-05-15T17:31:17.427Z  WARN mistralrs_core::pipeline::normal: Device mapping contains a mix of GPU and CPU. There is no CPU support for PagedAttention, disabling PagedAttention.
      2025-05-15T17:31:17.427Z  INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 151936, hidden_size: 896, intermediate_size: 4864, num_hidden_layers: 24, num_attention_heads: 14, num_key_value_heads: 2, max_position_embeddings: 32768, sliding_window: 32768, rope_theta: 1000000.0, rms_norm_eps: 1e-6, hidden_act: Silu, use_flash_attn: false, quantization_config: None, tie_word_embeddings: true }
      100%|████████████████████████████████████████| 290/290 [00:04<00:00, 215.20it/s]
      2025-05-15T17:31:23.422Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|endoftext|>", eos_toks = "<|im_end|>", "<|endoftext|>", unk_tok = `None`
      2025-05-15T17:31:23.447Z  INFO mistralrs_core: Beginning dummy run.
      2025-05-15T17:31:23.448Z  INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled! Expect higher multi-turn prompt throughput.
      2025-05-15T17:31:27.051Z  INFO mistralrs_core: Dummy run completed in 3.603769851s.
      2025-05-15T17:31:27.064Z  INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000 address="0.0.0.0:8000"
      I, [2025-05-15T17:33:40.704589 #1847]  INFO -- : Dynamo::configure
      I, [2025-05-15T17:33:40.704613 #1847]  INFO -- : Starting Dynamo...
      I, [2025-05-15T17:33:40.708244 #1847]  INFO -- : Starting Dynamo Web App...
      I, [2025-05-15T17:33:40.711826 #1847]  INFO -- : Dynamo web app running at http://localhost:5001
      I, [2025-05-15T17:33:40.711909 #1847]  INFO -- : Configuration completed successfully
      * Serving Flask app 'web_client'
      * Debug mode: on
      WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
      * Running on all addresses (0.0.0.0)
      * Running on http://127.0.0.1:5001
      * Running on http://172.20.0.5:5001
      Press CTRL+C to quit
    
  5. Test the Inference Endpoint If one-gate is enabled in your OpenNebula installation, the inference endpoint URL should be added to the VM information. Alternatively, you can use the VM's IP address and port 8000:

    $ onevm show dynamo-chatbot | grep DYNAMO.*CHATBOT
    ONEAPP_DYNAMO_CHATBOT_URL="http://172.20.0.3:8000/v1/chat/completions"

    We can access the chatbot API directly for instance with curl (note that we should indicate the model name correctly):

    user@opennebula:~$ curl 172.20.0.5:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
      "model": "Qwen2.5-0.5B-Instruct",
      "messages": [
        {
            "role": "user",
            "content": "Which is the capital of Spain?"
        }
      ],
      "stream":false,
      "max_tokens": 300
    }' | jq
    

    We should get something like the following json response with the answer in the message field:

    {
        "id": "3",
        "choices": [
            {
            "index": 0,
            "message": {
                "content": "The capital of Spain is Madrid.",
                "refusal": null,
                "tool_calls": null,
                "role": "assistant",
                "function_call": null,
                "audio": null
            },
            "finish_reason": "stop",
            "logprobs": null
            }
        ],
        "created": 3575561985,
        "model": "/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775",
        "service_tier": null,
        "system_fingerprint": "local",
        "object": "chat.completion",
        "usage": null
    }
    

    Alternatively you can use a web browser to access the built-in web interface, just point it to the ONEAPP_DYNAMO_CHATBOT_WEB URL, in this example http://172.20.0.5:5000

⚠️ **GitHub.com Fallback** ⚠️