Updated Plan - KrArunT/InfobellIT-Gen-AI GitHub Wiki

Distributed Deployment and Benchmarking

Write ansible playbook:

login.sh
deploy.sh
start_loadgen.sh
start_demo.sh

Install dependencies to machines/nodes.

Start.sh (Admin Node-1) (CPU only) (Ansible)

Node-1 (CPU only)
- Deploy Llama3.2:1b --> API Endpoint
- Deploy ViT --> API Endpoint
- Deploy EW --> API Endpoint
Node-2 (CPU + GPU)
- Deploy Llama3.2:1b --> API Endpoint
- Deploy ViT --> API Endpoint
- Deploy EW --> API Endpoint
Node-3 (CPU + FPGA)
- Deploy Llama3.2:1b --> API Endpoint
- Deploy ViT --> API Endpoint
- Deploy EW --> API Endpoint

Node-4 (CPU only)
- Start EchoSwift --> Llama3.2:1b Endpoint
  - (Run benchmark and discover optimal user count for given SUT, scale number of replicas for expected QPS/Throughput)
- Start ViT LoadGen --> ViT Endpoint
- Start EW --> EW Endpoint

a) Llama2-UI
- Chat-UI
- Metrics:
  1. Latency
  2. Throughput
  3. TTFT
b) VIT-UI
- Classification-UI
- Metrics:
  1. Latency
  2. Throughput (Samples/Second)
c) EW-UI (TBD)
d) Run benchmark and discover optimal user count/concurrent requests/parallel requests, number of replicas required for given SUT, scale number of replicas for expected QPS/Throughput.

Node-1 Admin (Deployment Scripts/Load Generator Scripts/Demo Scripts) (Main Entry-point)
Node-2 (Llama3.2:1b, ViT, EW Running on CPU only) (3 Endpoints) (Docker-Compose with HA-Proxy)
Node-3 (Llama3.2:1b, ViT, EW Running on CPU only) (AI Workload offloaded to Accelerator) (3 Endpoints) (Docker-Compose with HA-Proxy)
Node-4 (Llama3.2:1b, ViT, EW Running on CPU only) (AI Workload offloaded to Accelerator) (3 Endpoints) (Docker-Compose with HA-Proxy)
Node-5 (CPU only) (Demo) (UI App) (3 APIs)