Updated Plan - KrArunT/InfobellIT-Gen-AI GitHub Wiki
Distributed Deployment and Benchmarking
Steps
Write ansible playbook:
login.sh
deploy.sh
start_loadgen.sh
start_demo.sh
Install dependencies to machines/nodes.
- Setup password-less access
- Install docker in each machine (Identify docker images, Inference server)
- Run deployment script.
- Run loadgen script.
- Run demo script.
Entry-point:
Start.sh (Admin Node-1) (CPU only) (Ansible)
- NodeList:
["Node-1", "Node-2", "Node-3", "Node-4", "Node-5"]
- Setup password-less access to each node
1. Deployment Script.sh (Scale Replica to increase Expected Throughput/Latency/Any Other Metric)
-
Node-1 (CPU only)
- Deploy Llama3.2:1b --> API Endpoint
- Deploy ViT --> API Endpoint
- Deploy EW --> API Endpoint
-
Node-2 (CPU + GPU)
- Deploy Llama3.2:1b --> API Endpoint
- Deploy ViT --> API Endpoint
- Deploy EW --> API Endpoint
-
Node-3 (CPU + FPGA)
- Deploy Llama3.2:1b --> API Endpoint
- Deploy ViT --> API Endpoint
- Deploy EW --> API Endpoint
2. Run Workload Gen.sh
- Node-4 (CPU only)
- Start EchoSwift --> Llama3.2:1b Endpoint
- (Run benchmark and discover optimal user count for given SUT, scale number of replicas for expected QPS/Throughput)
- Start ViT LoadGen --> ViT Endpoint
- Start EW --> EW Endpoint
- Start EchoSwift --> Llama3.2:1b Endpoint
3. Demo.sh (Node-1)
-
a) Llama2-UI
- Chat-UI
- Metrics:
- Latency
- Throughput
- TTFT
-
b) VIT-UI
- Classification-UI
- Metrics:
- Latency
- Throughput (Samples/Second)
-
c) EW-UI (TBD)
-
d) Run benchmark and discover optimal user count/concurrent requests/parallel requests, number of replicas required for given SUT, scale number of replicas for expected QPS/Throughput.
Node Details:
- Node-1 Admin (Deployment Scripts/Load Generator Scripts/Demo Scripts) (Main Entry-point)
- Node-2 (Llama3.2:1b, ViT, EW Running on CPU only) (3 Endpoints) (Docker-Compose with HA-Proxy)
- Node-3 (Llama3.2:1b, ViT, EW Running on CPU only) (AI Workload offloaded to Accelerator) (3 Endpoints) (Docker-Compose with HA-Proxy)
- Node-4 (Llama3.2:1b, ViT, EW Running on CPU only) (AI Workload offloaded to Accelerator) (3 Endpoints) (Docker-Compose with HA-Proxy)
- Node-5 (CPU only) (Demo) (UI App) (3 APIs)
Misc. Info:
- Ansible Playbook