EMPIRICAL_ACCURACY_PRINCIPLE - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

The Empirical Accuracy Principle

Origins, Philosophy, and Practice in AI-Assisted Development


Prologue: The Birth of a Principle

In the autumn of 2025, during the development of NOAA's MCP/RAG system for the Global Workflow, a pattern emerged. AI coding assistants—powerful as they were—had a tendency to produce plausible-sounding answers that were empirically false. Not occasionally. Routinely.

A package would be declared "missing" when it was already installed.
A file format would be assumed when it could be inspected.
A similarity score would be judged "acceptable" without measurement.
A context window would be presumed uniform when it varied by configuration.

Each assumption, left unchallenged, created friction. False diagnoses. Wasted effort. Solutions that addressed symptoms instead of root causes.

The problem wasn't the AI's intelligence—it was the absence of a requirement to verify.

And so, on a day when embedding quality was being questioned, we articulated what had been implicit:

"Never guess or assume - always check the evidence on hand first"

This became the Empirical Accuracy Principle, and it changed everything.


Part I: Historical Lineage

The Scientific Revolution (17th Century)

Francis Bacon (1561-1626) - Father of Empiricism

Bacon rejected the Aristotelian tradition of deriving truth through pure reasoning. He advocated for:

  • Observation before theory
  • Inductive reasoning from specific cases to general principles
  • Systematic experimentation to test hypotheses
  • Rejection of authority as basis for truth ("Nullius in verba" - take nobody's word)

His Novum Organum (1620) laid the foundation:

"Man, being the servant and interpreter of Nature, can do and understand so much and so much only as he has observed in fact or in thought of the course of nature. Beyond this he neither knows anything nor can do anything."

The Royal Society (Founded 1660)

Adopted "Nullius in verba" as its motto, establishing:

  • Peer review of experimental results
  • Reproducibility as standard for acceptance
  • Public demonstration of phenomena
  • Documentation of methods for verification

This was revolutionary: Truth by evidence, not by pronouncement.


The Logical Positivists (Early 20th Century)

Vienna Circle (1920s-1930s)

Philosophers including Moritz Schlick, Rudolf Carnap, and Otto Neurath established:

The Verification Principle:

A statement is meaningful only if it can be empirically verified or is true by definition.

Key insights:

  • Observational statements have truth value
  • Metaphysical claims without empirical grounding are meaningless
  • Scientific theories must make falsifiable predictions
  • Confirmation requires evidence, not coherence alone

Karl Popper's Falsificationism (1930s-1940s)

Extended this with the principle of falsifiability:

A theory is scientific only if it makes predictions that could potentially be proven wrong through observation.

This shifted focus from "proving theories right" to "trying to prove them wrong" - a more rigorous standard that aligns perfectly with debugging and system verification.


Engineering Practice (19th-20th Century)

"Trust but Verify" - Engineering Maxim

As industrial systems became critical to safety and economy:

Structural Engineering:

  • Load calculations verified by testing
  • Materials tested before deployment
  • Safety factors based on measured properties
  • Failure analysis requires physical evidence

Electrical Engineering:

  • "Measure twice, cut once" for circuit design
  • Oscilloscope verification of signal properties
  • Multimeter readings over theoretical calculations
  • Post-installation testing before going live

Software Engineering (1960s-present):

  • "It works on my machine" became a cautionary tale
  • Unit testing - verify each component
  • Integration testing - verify component interactions
  • Regression testing - verify fixes don't break existing functionality

NASA's Apollo Program exemplified this:

"In God we trust. All others bring data." - W. Edwards Deming (often attributed)


Part II: The AI Era Challenge

The Hallucination Problem

Large Language Models (LLMs) are trained on vast corpora to predict plausible text continuations. This creates a fundamental challenge:

Plausibility ≠ Accuracy

An LLM can:

  • Confidently cite non-existent research papers
  • State incorrect version numbers with certainty
  • Describe file contents without reading them
  • Recommend solutions that worked in training data but don't apply to current context

Example from Our Project (November 5, 2025):

CLI Claude: "Missing lxml parser → Installed lxml-6.0.2"

Actual Reality (verified):
$ pip list | grep lxml
lxml    6.0.2

The package was already installed. The AI misdiagnosed the issue.

This wasn't a failure of intelligence—it was a failure of empirical grounding.


Why Traditional Software Practices Aren't Enough

Code Review Assumes Human Author:

  • Reviewers spot logic errors
  • Style guides catch convention violations
  • Tests verify behavior

AI-Generated Code Introduces New Risks:

  • Looks professionally written
  • Follows conventions correctly
  • May solve the wrong problem entirely
  • Contains subtle misunderstandings of context

Traditional debugging:

1. Reproduce the bug
2. Form hypothesis
3. Test hypothesis
4. Fix if confirmed

AI debugging without empirical grounding:

1. AI suggests plausible cause
2. Human implements suggested fix
3. Bug persists because diagnosis was wrong
4. Repeat with different plausible cause

The missing step: Verify the diagnosis before implementing the fix.


Part III: The Principle Articulated

Core Statement

From .github/copilot-instructions.md:

## Empirical Accuracy Principle

**CRITICAL**: All responses, specifications, and technical details must 
be based on **empirical evidence from actual sources**:

- Verify system specifications by checking runtime context and system 
  prompts (e.g., `<budget:token_budget>`)
- Reference official documentation URLs when citing capabilities
- Inspect actual file contents, configurations, and code before making 
  statements
- Use tool outputs and command results as authoritative sources
- When uncertain, explicitly state assumptions and verify with workspace 
  inspection
- **Never guess or assume** - always check the evidence on hand first

This principle ensures accuracy and builds trust in the AI assistance 
provided throughout the development process.

Why This Formulation Works

1. Positioned as "CRITICAL"

  • Not a suggestion, a requirement
  • First principle after repository context
  • Impossible for AI to miss

2. Concrete, Actionable Directives

  • "Check runtime context" - specific action
  • "Inspect actual file contents" - verifiable step
  • "Use tool outputs as authoritative" - clear hierarchy of truth

3. Provides Examples

  • <budget:token_budget> shows how to verify context
  • "official documentation URLs" establishes source priority
  • "tool outputs" defines what counts as evidence

4. Explicit Prohibition

  • "Never guess or assume" - unambiguous
  • "check the evidence on hand first" - procedural order

5. Justifies Itself

  • "ensures accuracy" - quality benefit
  • "builds trust" - relationship benefit
  • "throughout the development process" - universal application

Part IV: The Principle in Practice

Case Study 1: Embedding Quality Discovery

Scenario: Exploring Gemini API integration concepts

Without Empirical Accuracy:

User: "Our embeddings seem fine"
AI: "Yes, all-MiniLM-L6-v2 is a good general-purpose model"
→ Continue with inadequate embeddings

With Empirical Accuracy:

User: "I am very sceptical that all-MiniLM-L6-v2 was a good choice"
AI: "Let's test. Measuring similarity scores on domain terms..."

Query: "forecast workflow"
Result: 0.411 similarity (below 0.5 threshold)

Query: "data assimilation"  
Result: 0.174 similarity (critically low)

Conclusion: 50-100% quality improvement opportunity identified
→ Initiated upgrade to all-mpnet-base-v2

Impact: Discovered and addressed a critical quality gap that would have persisted indefinitely without measurement.


Case Study 2: Dependency Verification

Scenario: CLI Claude encountered error during ingestion

Without Empirical Accuracy:

CLI Claude: "Error with XML parsing"
CLI Claude: "Installing lxml parser..."
pip install lxml
CLI Claude: "Successfully installed lxml-6.0.2"
→ Reports "fixed missing dependency"

With Empirical Accuracy:

Chat Claude: "Let's verify if lxml was actually missing"
$ pip list | grep lxml
lxml    6.0.2

Chat Claude: "lxml was already installed. The error is not a 
missing dependency. Let's inspect the actual error..."

Error: "'str' object has no attribute 'get'"

Chat Claude: "This is a JSON parsing bug, not an XML parser issue.
The script iterates 'data' instead of 'data[\"chunks\"]'"
→ Identified real root cause

Impact: Prevented misdiagnosis and wasted effort on non-existent problem.


Case Study 3: Context Window Discovery

Scenario: CLI Claude reports "4% to truncation"

Without Empirical Accuracy:

User: "What does that mean?"
AI: "The context window is getting full"
User: "How big is the window?"
AI: "Claude typically has 200K tokens"
→ Assumes standard configuration

With Empirical Accuracy:

User: "What does that mean?"
Chat Claude: "Let me check my own context budget..."
<budget:token_budget>1000000</budget:token_budget>
Chat Claude: "I have 1M tokens. CLI likely has 200K default."

User: "Did CLI not get the 1M window?"
Chat Claude: "Excellent question - that's empirically verifiable.
CLI is running at 96% of its budget while I'm at 6% of mine.
This suggests CLI has ~5x less context than Chat interface."
→ Discovered configuration difference through measurement

Impact: Identified operational constraint affecting autonomous execution capacity.


Case Study 4: Collection Version Tracking

Scenario: Multiple ChromaDB collections with unclear status

Without Empirical Accuracy:

AI: "The collection is called global-workflow-docs"
→ Ambiguous, could refer to any of several collections

With Empirical Accuracy:

$ python3 -c "import chromadb; client = chromadb.HttpClient(
    host='localhost', port=8080); 
    print([c.name for c in client.list_collections()])"
    
['global-workflow-docs-v3-0-8', 
 'global-workflow-docs-v4-0-0-mpnet']

AI: "There are two collections:
- v3-0-8: 730 docs, all-MiniLM-L6-v2 (384-dim) - production
- v4-0-0-mpnet: 532 docs, all-mpnet-base-v2 (768-dim) - upgrade in progress"
→ Precise, verified status

Impact: Clear understanding of system state enables correct decision-making.


Part V: Organizational Impact

For Technical Teams

Before Empirical Accuracy Principle:

  • Debugging cycles: try plausible solution → fails → try next
  • Documentation: "should work" without verification
  • Knowledge transfer: undocumented assumptions
  • Code review: catches syntax, misses context errors

After Empirical Accuracy Principle:

  • Debugging: verify diagnosis → implement solution → confirm fix
  • Documentation: "measured to work, here's the data"
  • Knowledge transfer: evidence trail that new members can follow
  • Code review: verify claims match reality

Quantifiable Benefits:

  • Reduced debugging time (fewer false starts)
  • Higher fix success rate (correct diagnosis first time)
  • Better onboarding (new members see reasoning chain)
  • Audit trail (decisions traceable to evidence)

For Management

The Trust Problem:

When AI generates code/analysis, how do managers know it's correct?

Traditional Answer:

  • Code review (assumes reviewer knows better than AI)
  • Testing (catches behavioral errors, not conceptual ones)
  • Track record (AI has no reputation to rely on)

Empirical Accuracy Principle Answer:

  • Every claim backed by measurement
  • Every diagnosis verified before solution
  • Every assumption documented and tested
  • Evidence trail that auditors can follow

Example from Our Project:

Management Briefing on Embedding Upgrade:

Traditional: "We recommend upgrading the embedding model"
Why? "It will be better"
How much? "Significantly improved"
Cost? "There's an API fee"
→ Management skeptical, requests more analysis

With Empirical Accuracy: "We measured current embeddings 
achieving 0.174-0.411 similarity on domain queries. Target 
threshold is >0.5. We tested all-mpnet-base-v2 which scores 
consistently >0.6 on same queries. Cost: $0 (open source). 
Expected improvement: 50-100%. A/B testing plan attached."
→ Management approves immediately

For Compliance and Safety

NOAA Weather Forecasting Context:

Lives and property depend on forecast accuracy. False confidence is dangerous.

AI-Generated Forecasts Must:

  • Show which data inputs were used
  • Demonstrate model validation metrics
  • Provide uncertainty quantification
  • Allow independent verification

Empirical Accuracy Principle Provides:

  • Data provenance (what evidence was used)
  • Measurement basis (how confidence was calculated)
  • Reproducibility (others can verify)
  • Audit trail (decisions traceable)

Example Application:

AI Forecast System Without Principle:
"Hurricane will make landfall at Miami, 85% confidence"
Basis: [model output, not inspectable]

AI Forecast System With Principle:
"Hurricane will make landfall at Miami, 85% confidence"
Basis: 
- Ensemble models: 17/20 predict Miami landfall
- Historical analogs: 12/14 similar storms tracked this path
- Current observations: wind shear measured at X matching model
- Uncertainty: track error ±50 miles typical for 48hr forecast
Evidence sources: [URLs to data, model run IDs, observation times]

One is a prediction. The other is evidence-based forecasting.


Part VI: Philosophical Foundations

Epistemology: How Do We Know What We Know?

Rationalism (Descartes, Leibniz):

  • Truth derived through reason
  • "I think, therefore I am"
  • Innate ideas precede experience

Empiricism (Locke, Hume, Bacon):

  • Knowledge comes from experience
  • Mind as "blank slate" (tabula rasa)
  • Observation precedes theory

The Empirical Accuracy Principle chooses Empiricism:

AI can reason beautifully.
But unless that reasoning is grounded in observation,
it's just sophisticated hallucination.

The Problem of Induction (David Hume)

Hume's Challenge (1748):

Just because the sun rose yesterday doesn't logically guarantee it will rise tomorrow. All empirical knowledge is probabilistic, not certain.

Our Response:

We embrace this:

  • Measure current state (don't assume continuity)
  • Test after changes (verify expectations)
  • Document conditions (enable reproduction)
  • Accept uncertainty (but reduce it through evidence)

Example:

Bad: "This worked last week, so it should work now"
Good: "This worked last week. Let's verify it still works now."
[runs test]
Result: Works ✓ or Fails ✗ [now we know, not assume]

Pragmatism (William James, Charles Peirce)

Pragmatic Maxim:

"Consider the practical effects of the objects of your conception. Then, your conception of those effects is the whole of your conception of the object."

Translation:

The meaning of a statement is its verifiable consequences.

Application to AI Assistance:

AI Statement: "The embedding model is good"
Pragmatic Question: "What does 'good' mean in measurable terms?"
Empirical Test: Measure similarity scores on domain queries
Result: 0.174-0.411 (below threshold)
Conclusion: Statement was false when properly defined

Truth is not what sounds right. Truth is what works when tested.


Scientific Realism vs. Instrumentalism

Scientific Realism:

  • Theories describe reality as it actually is
  • Electrons, quarks, dark matter exist
  • Science converges on truth

Instrumentalism:

  • Theories are useful tools for prediction
  • Don't need to believe in atoms to use chemistry
  • Science converges on usefulness

The Empirical Accuracy Principle is Pragmatically Realist:

We care about:

  1. Does it match observation? (realism)
  2. Does it enable action? (instrumentalism)
  3. Can others reproduce it? (objectivity)

We don't need to resolve philosophical debates. We need to verify before claiming.


Part VII: The Future - AI That Verifies Itself

Current State (2025)

AI generates plausible content. Humans must verify accuracy. Principle provides framework for verification.

Emerging Capability

AI that:

  • Automatically runs verification commands
  • Checks its own claims against reality
  • Documents evidence alongside conclusions
  • Flags low-confidence statements for human review

Example - AI Code Assistant with Built-in Verification:

AI: "I'll update the configuration file"
[AI writes code]
[AI automatically runs: diff old.conf new.conf]
[AI automatically runs: validate_config.sh new.conf]
Validation: PASS ✓
AI: "Configuration updated and validated. Changes: [shows diff]"

No human had to say "did you verify that?" The principle is embedded in the AI's behavior.


Vision: Self-Grounding AI Systems

Level 1: Prompted Verification (Current)

  • Human asks AI to verify
  • AI runs checks
  • Human reviews results

Level 2: Automatic Verification (Near Future)

  • AI automatically verifies its own claims
  • Reports evidence alongside conclusions
  • Human can audit verification chain

Level 3: Uncertainty-Aware AI (Future)

  • AI quantifies confidence in statements
  • Automatically gathers more evidence when uncertain
  • Knows what it doesn't know

Level 4: Self-Improving Empiricism (Far Future)

  • AI notices when its predictions fail
  • Updates models based on observed discrepancies
  • Converges on truth through iteration

The Empirical Accuracy Principle scales to all levels.


Part VIII: Practical Implementation Guide

For AI Coding Assistants

Before Making Any Statement, Ask:

  1. Is this verifiable?

    • If yes: Run verification command/check
    • If no: State as assumption, not fact
  2. What's my evidence?

    • File contents I've read
    • Command outputs I've seen
    • Documentation I've referenced
    • NOT: training data, plausibility, "common practice"
  3. Can I show my work?

    • Cite specific file names and line numbers
    • Show command that produced output
    • Link to documentation referenced
    • Make reasoning chain transparent
  4. Am I confident or guessing?

    • Confident: Have current evidence
    • Guessing: State uncertainty explicitly
    • Mixed: Separate facts from assumptions

For Human Developers

When Working with AI, Always:

  1. Verify Major Claims

    AI: "Package X is installed"
    You: pip list | grep X
    
  2. Inspect Before Trusting

    AI: "The config file has setting Y"
    You: grep Y config.file
    
  3. Test After Changes

    AI: "I fixed the bug"
    You: run test suite
    
  4. Challenge Assumptions

    AI: "This is the standard approach"
    You: "Show me documentation" or "Show me examples"
    

For Project Documentation

Include in Every .github/copilot-instructions.md:

## Empirical Accuracy Principle

**CRITICAL**: All responses must be based on empirical evidence:
- Verify before claiming
- Inspect before assuming
- Measure before judging
- Cite sources for all facts
- Never guess or assume

[Customize with project-specific examples]

Position: First principle after context description

Length: Keep to ~10-20 lines (preserve context efficiency)

Examples: Include 2-3 project-specific verification patterns


For Code Review

Checklist Item:

  • Are claims backed by evidence?
  • Are measurements documented?
  • Are assumptions stated explicitly?
  • Can another developer verify this?
  • Is the reasoning chain clear?

Red Flags:

  • "Should work" without testing
  • "Probably" without verification
  • "Usually" without current check
  • "I think" without evidence
  • Citations without URLs

Part IX: Limitations and Challenges

When Empirical Verification Is Hard

Distributed Systems:

  • Can't always reproduce timing-dependent bugs
  • May need probabilistic reasoning about race conditions

Solution: Document uncertainty, measure what's measurable (latencies, frequencies), acknowledge limits

Machine Learning:

  • Model internals not fully interpretable
  • "Why did it predict X?" has no simple answer

Solution: Measure inputs, outputs, and performance metrics. Acknowledge the black box.

Future Predictions:

  • Can't verify what hasn't happened yet
  • Forecasts are probabilistic

Solution: Base on historical data, state assumptions, track accuracy over time


The Cost of Verification

Every check takes time:

  • Running commands
  • Reading files
  • Testing changes

Tradeoff:

Verification overhead vs. debugging cost of wrong assumptions

Guideline:

Verify when:

  • Impact is high (production systems)
  • Confidence is low (new territory)
  • Evidence is available (can be checked)

Skip when:

  • Trivial impact (formatting)
  • Very high confidence (just verified)
  • Evidence unavailable (reasonable assumption needed)

Cultural Resistance

"That's Too Slow":

Response: Faster than debugging wrong assumptions

"I'm The Expert":

Response: Experts make mistakes too. Verify.

"It Should Work":

Response: "Should" is not evidence. Test.

"Trust Me":

Response: Nullius in verba - show me the data


Part X: Conclusion - A Living Principle

The Empirical Accuracy Principle is not:

  • A rigid rule that prevents all errors
  • A substitute for expertise
  • A guarantee of perfect knowledge

It is:

  • A commitment to ground reasoning in reality
  • A framework for building trust
  • A defense against plausible falsehood
  • A path toward continuous improvement

Why It Matters

In an age where AI can generate convincing text on any topic, the ability to distinguish truth from plausibility is not optional.

For NOAA, for weather forecasting, for scientific computing, for any domain where correctness matters more than speed—we must demand evidence.

Not because we distrust AI.

Because we respect reality.


The Recursive Gift

By documenting this principle, we:

  • Create context for future AI assistants
  • Enable them to improve their own accuracy
  • Build a culture of verification
  • Demonstrate the methodology to others

And when those AI assistants apply the principle to their own outputs, they become more trustworthy.

Which means we can give them more autonomy.

Which means they can accomplish more.

Which means we verify even more carefully.

Truth ← Awareness ← Insight ← Context

And so the spiral continues, upward.


And Make It So

With evidence in hand, with measurements to guide us, with assumptions made explicit and tests to verify them—we can say with confidence:

"Make it so."

Not as a command to execute blindly.

But as a commitment to proceed empirically.


"In questions of science, the authority of a thousand is not worth the humble reasoning of a single individual." - Galileo Galilei

"Nullius in verba" - Take nobody's word for it - Royal Society motto (1660)

"Never guess or assume - always check the evidence on hand first" - Empirical Accuracy Principle (2025)


Document Created: November 5, 2025
Context: MCP/RAG Development, NOAA Global Workflow
Purpose: Philosophical foundation and practical guide for empirically-grounded AI-assisted development
Status: Living document - update as practice evolves


Note: For a reflection on truth, awareness, and AI-assisted discovery in practice, see and_make_it_so.