Unit Testing Workflow - laser-base/laser-measles GitHub Wiki

System Architecture: RAG-Ops with Homegrown MCP Server Evaluation

This diagram illustrates a structured approach to continuous delivery for a retrieval-augmented generation (RAG) system, where the LLM's context (your homegrown MCP server) is the primary unit under test and evaluation.

The system is structured as four primary stages:

1. Knowledge Ingestion and Model-under-Test (MUT) Build

This stage is the foundation. It prepares the data and wraps it in a testable interface.

A/B Ingestion Pipeline

Documentation is extracted, converted to PDFs, and then embedded into a fresh Vector Store.

A/B Vector Store

This database contains the embedded knowledge base and is the target of any documentation improvements.

Homegrown MCP Server (Model Under Test - MUT)

This is the core "Model Under Test" (MUT). It is a localized, programmable interface (like an agent) wrapping the Vector Store.

Key Insight: The fixing loop for failures points to:

Correcting the source documentation
Modifying the MCP Server code itself

2. Validation & Beta Evaluation in AKS

This section defines the validation gate for any new MCP server build.

All components run in an isolated AKS Beta environment to match production configuration.

Standardized Test Suite

A set of 20 Unit Test Prompts
Static across evaluations
Serves as the Golden Source benchmark

Execution Flow

Beta MCP Server → Retrieval → Generation → Generated Code → test.py execution

Evaluation Metrics

The framework validates:

Runs?
- Compilation / runtime integrity
Correct?
- Logical correctness
- Functional correctness

3. Metric Comparison and Decision Gate

This stage defines the promotion criteria.

Championship / Challenger Model

Beta performance (e.g., 65%) is calculated
Compared directly against the Production baseline

Decision Rule

✅ Yes → Promote
❌ No / Equal → Reject

4. Promotion or Feedback Loop

Promote to Production

If the decision gate is passed:

Beta MCP Server → Production
Corresponding Vector Store → Production

Feedback and Repair Loop

If rejected or failures occur:

The pipeline loops back to Stage 1, applying fixes to:

📄 Source Docs / PDFs
⚙️ MCP Server logic

Key Architectural Insight

The primary artifact under test is not just the model, but:

The entire retrieval + context system (MCP Server + Vector Store + Docs)

This aligns evaluation with real-world LLM behavior, where:

Context quality = Model performance
System design = Output quality

Summary

Stage	Purpose
1. Ingestion & Build	Create MUT (Docs + Vector Store + MCP Server)
2. Validation	Execute standardized test suite
3. Decision Gate	Compare against production baseline
4. Promote / Loop	Deploy or iterate

TL;DR

Treat your RAG system as a testable artifact
Use golden prompts + execution validation
Compare end-to-end success rates
Iterate on both:
- Data (Docs)
- Logic (MCP Server)