Unit Testing Workflow - laser-base/laser-measles GitHub Wiki
System Architecture: RAG-Ops with Homegrown MCP Server Evaluation
This diagram illustrates a structured approach to continuous delivery for a retrieval-augmented generation (RAG) system, where the LLM's context (your homegrown MCP server) is the primary unit under test and evaluation.
The system is structured as four primary stages:
1. Knowledge Ingestion and Model-under-Test (MUT) Build
This stage is the foundation. It prepares the data and wraps it in a testable interface.
A/B Ingestion Pipeline
Documentation is extracted, converted to PDFs, and then embedded into a fresh Vector Store.
A/B Vector Store
This database contains the embedded knowledge base and is the target of any documentation improvements.
Homegrown MCP Server (Model Under Test - MUT)
This is the core "Model Under Test" (MUT). It is a localized, programmable interface (like an agent) wrapping the Vector Store.
Key Insight: The fixing loop for failures points to:
- Correcting the source documentation
- Modifying the MCP Server code itself
2. Validation & Beta Evaluation in AKS
This section defines the validation gate for any new MCP server build.
All components run in an isolated AKS Beta environment to match production configuration.
Standardized Test Suite
- A set of 20 Unit Test Prompts
- Static across evaluations
- Serves as the Golden Source benchmark
Execution Flow
Beta MCP Server → Retrieval → Generation → Generated Code → test.py execution
Evaluation Metrics
The framework validates:
-
Runs?
- Compilation / runtime integrity
-
Correct?
- Logical correctness
- Functional correctness
3. Metric Comparison and Decision Gate
This stage defines the promotion criteria.
Championship / Challenger Model
- Beta performance (e.g., 65%) is calculated
- Compared directly against the Production baseline
Decision Rule
- ✅ Yes → Promote
- ❌ No / Equal → Reject
4. Promotion or Feedback Loop
Promote to Production
If the decision gate is passed:
- Beta MCP Server → Production
- Corresponding Vector Store → Production
Feedback and Repair Loop
If rejected or failures occur:
The pipeline loops back to Stage 1, applying fixes to:
- 📄 Source Docs / PDFs
- ⚙️ MCP Server logic
Key Architectural Insight
The primary artifact under test is not just the model, but:
The entire retrieval + context system (MCP Server + Vector Store + Docs)
This aligns evaluation with real-world LLM behavior, where:
- Context quality = Model performance
- System design = Output quality
Summary
| Stage | Purpose |
|---|---|
| 1. Ingestion & Build | Create MUT (Docs + Vector Store + MCP Server) |
| 2. Validation | Execute standardized test suite |
| 3. Decision Gate | Compare against production baseline |
| 4. Promote / Loop | Deploy or iterate |
TL;DR
- Treat your RAG system as a testable artifact
- Use golden prompts + execution validation
- Compare end-to-end success rates
- Iterate on both:
- Data (Docs)
- Logic (MCP Server)