Updated Plan - KrArunT/InfobellIT-Gen-AI GitHub Wiki

Distributed Deployment and Benchmarking

Steps

Write ansible playbook:

login.sh
deploy.sh
start_loadgen.sh
start_demo.sh

Install dependencies to machines/nodes.

  1. Setup password-less access
  2. Install docker in each machine (Identify docker images, Inference server)
  3. Run deployment script.
  4. Run loadgen script.
  5. Run demo script.

Entry-point:

Start.sh (Admin Node-1) (CPU only) (Ansible)

  • NodeList: ["Node-1", "Node-2", "Node-3", "Node-4", "Node-5"]
  • Setup password-less access to each node

1. Deployment Script.sh (Scale Replica to increase Expected Throughput/Latency/Any Other Metric)

  • Node-1 (CPU only)

    • Deploy Llama3.2:1b --> API Endpoint
    • Deploy ViT --> API Endpoint
    • Deploy EW --> API Endpoint
  • Node-2 (CPU + GPU)

    • Deploy Llama3.2:1b --> API Endpoint
    • Deploy ViT --> API Endpoint
    • Deploy EW --> API Endpoint
  • Node-3 (CPU + FPGA)

    • Deploy Llama3.2:1b --> API Endpoint
    • Deploy ViT --> API Endpoint
    • Deploy EW --> API Endpoint

2. Run Workload Gen.sh

  • Node-4 (CPU only)
    • Start EchoSwift --> Llama3.2:1b Endpoint
      • (Run benchmark and discover optimal user count for given SUT, scale number of replicas for expected QPS/Throughput)
    • Start ViT LoadGen --> ViT Endpoint
    • Start EW --> EW Endpoint

3. Demo.sh (Node-1)

  • a) Llama2-UI

    • Chat-UI
    • Metrics:
      1. Latency
      2. Throughput
      3. TTFT
  • b) VIT-UI

    • Classification-UI
    • Metrics:
      1. Latency
      2. Throughput (Samples/Second)
  • c) EW-UI (TBD)

  • d) Run benchmark and discover optimal user count/concurrent requests/parallel requests, number of replicas required for given SUT, scale number of replicas for expected QPS/Throughput.

Node Details:

  • Node-1 Admin (Deployment Scripts/Load Generator Scripts/Demo Scripts) (Main Entry-point)
  • Node-2 (Llama3.2:1b, ViT, EW Running on CPU only) (3 Endpoints) (Docker-Compose with HA-Proxy)
  • Node-3 (Llama3.2:1b, ViT, EW Running on CPU only) (AI Workload offloaded to Accelerator) (3 Endpoints) (Docker-Compose with HA-Proxy)
  • Node-4 (Llama3.2:1b, ViT, EW Running on CPU only) (AI Workload offloaded to Accelerator) (3 Endpoints) (Docker-Compose with HA-Proxy)
  • Node-5 (CPU only) (Demo) (UI App) (3 APIs)

Misc. Info:

  • Ansible Playbook