Unit Testing Workflow - laser-base/laser-measles GitHub Wiki

System Architecture: RAG-Ops with Homegrown MCP Server Evaluation

This diagram illustrates a structured approach to continuous delivery for a retrieval-augmented generation (RAG) system, where the LLM's context (your homegrown MCP server) is the primary unit under test and evaluation.

The system is structured as four primary stages:


1. Knowledge Ingestion and Model-under-Test (MUT) Build

This stage is the foundation. It prepares the data and wraps it in a testable interface.

A/B Ingestion Pipeline

Documentation is extracted, converted to PDFs, and then embedded into a fresh Vector Store.

A/B Vector Store

This database contains the embedded knowledge base and is the target of any documentation improvements.

Homegrown MCP Server (Model Under Test - MUT)

This is the core "Model Under Test" (MUT). It is a localized, programmable interface (like an agent) wrapping the Vector Store.

Key Insight: The fixing loop for failures points to:

  • Correcting the source documentation
  • Modifying the MCP Server code itself

2. Validation & Beta Evaluation in AKS

This section defines the validation gate for any new MCP server build.

All components run in an isolated AKS Beta environment to match production configuration.

Standardized Test Suite

  • A set of 20 Unit Test Prompts
  • Static across evaluations
  • Serves as the Golden Source benchmark

Execution Flow

Beta MCP Server → Retrieval → Generation → Generated Code → test.py execution

Evaluation Metrics

The framework validates:

  • Runs?

    • Compilation / runtime integrity
  • Correct?

    • Logical correctness
    • Functional correctness

3. Metric Comparison and Decision Gate

This stage defines the promotion criteria.

Championship / Challenger Model

  • Beta performance (e.g., 65%) is calculated
  • Compared directly against the Production baseline

Decision Rule

  • Yes → Promote
  • No / Equal → Reject

4. Promotion or Feedback Loop

Promote to Production

If the decision gate is passed:

  • Beta MCP Server → Production
  • Corresponding Vector Store → Production

Feedback and Repair Loop

If rejected or failures occur:

The pipeline loops back to Stage 1, applying fixes to:

  • 📄 Source Docs / PDFs
  • ⚙️ MCP Server logic

Key Architectural Insight

The primary artifact under test is not just the model, but:

The entire retrieval + context system (MCP Server + Vector Store + Docs)

This aligns evaluation with real-world LLM behavior, where:

  • Context quality = Model performance
  • System design = Output quality

Summary

Stage Purpose
1. Ingestion & Build Create MUT (Docs + Vector Store + MCP Server)
2. Validation Execute standardized test suite
3. Decision Gate Compare against production baseline
4. Promote / Loop Deploy or iterate

TL;DR

  • Treat your RAG system as a testable artifact
  • Use golden prompts + execution validation
  • Compare end-to-end success rates
  • Iterate on both:
    • Data (Docs)
    • Logic (MCP Server)