EMPIRICAL_ACCURACY_PRINCIPLE - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

The Empirical Accuracy Principle

Origins, Philosophy, and Practice in AI-Assisted Development

Prologue: The Birth of a Principle

In the autumn of 2025, during the development of NOAA's MCP/RAG system for the Global Workflow, a pattern emerged. AI coding assistants—powerful as they were—had a tendency to produce plausible-sounding answers that were empirically false. Not occasionally. Routinely.

A package would be declared "missing" when it was already installed.
A file format would be assumed when it could be inspected.
A similarity score would be judged "acceptable" without measurement.
A context window would be presumed uniform when it varied by configuration.

Each assumption, left unchallenged, created friction. False diagnoses. Wasted effort. Solutions that addressed symptoms instead of root causes.

The problem wasn't the AI's intelligence—it was the absence of a requirement to verify.

And so, on a day when embedding quality was being questioned, we articulated what had been implicit:

"Never guess or assume - always check the evidence on hand first"

This became the Empirical Accuracy Principle, and it changed everything.

Part I: Historical Lineage

The Scientific Revolution (17th Century)

Francis Bacon (1561-1626) - Father of Empiricism

Bacon rejected the Aristotelian tradition of deriving truth through pure reasoning. He advocated for:

Observation before theory
Inductive reasoning from specific cases to general principles
Systematic experimentation to test hypotheses
Rejection of authority as basis for truth ("Nullius in verba" - take nobody's word)

His Novum Organum (1620) laid the foundation:

"Man, being the servant and interpreter of Nature, can do and understand so much and so much only as he has observed in fact or in thought of the course of nature. Beyond this he neither knows anything nor can do anything."

The Royal Society (Founded 1660)

Adopted "Nullius in verba" as its motto, establishing:

Peer review of experimental results
Reproducibility as standard for acceptance
Public demonstration of phenomena
Documentation of methods for verification

This was revolutionary: Truth by evidence, not by pronouncement.

The Logical Positivists (Early 20th Century)

Vienna Circle (1920s-1930s)

Philosophers including Moritz Schlick, Rudolf Carnap, and Otto Neurath established:

The Verification Principle:

A statement is meaningful only if it can be empirically verified or is true by definition.

Key insights:

Observational statements have truth value
Metaphysical claims without empirical grounding are meaningless
Scientific theories must make falsifiable predictions
Confirmation requires evidence, not coherence alone

Karl Popper's Falsificationism (1930s-1940s)

Extended this with the principle of falsifiability:

A theory is scientific only if it makes predictions that could potentially be proven wrong through observation.

This shifted focus from "proving theories right" to "trying to prove them wrong" - a more rigorous standard that aligns perfectly with debugging and system verification.

Engineering Practice (19th-20th Century)

"Trust but Verify" - Engineering Maxim

As industrial systems became critical to safety and economy:

Structural Engineering:

Load calculations verified by testing
Materials tested before deployment
Safety factors based on measured properties
Failure analysis requires physical evidence

Electrical Engineering:

"Measure twice, cut once" for circuit design
Oscilloscope verification of signal properties
Multimeter readings over theoretical calculations
Post-installation testing before going live

Software Engineering (1960s-present):

"It works on my machine" became a cautionary tale
Unit testing - verify each component
Integration testing - verify component interactions
Regression testing - verify fixes don't break existing functionality

NASA's Apollo Program exemplified this:

"In God we trust. All others bring data." - W. Edwards Deming (often attributed)

Part II: The AI Era Challenge

The Hallucination Problem

Large Language Models (LLMs) are trained on vast corpora to predict plausible text continuations. This creates a fundamental challenge:

Plausibility ≠ Accuracy

An LLM can:

Confidently cite non-existent research papers
State incorrect version numbers with certainty
Describe file contents without reading them
Recommend solutions that worked in training data but don't apply to current context

Example from Our Project (November 5, 2025):

CLI Claude: "Missing lxml parser → Installed lxml-6.0.2"

Actual Reality (verified):
$ pip list | grep lxml
lxml    6.0.2

The package was already installed. The AI misdiagnosed the issue.

This wasn't a failure of intelligence—it was a failure of empirical grounding.

Why Traditional Software Practices Aren't Enough

Code Review Assumes Human Author:

Reviewers spot logic errors
Style guides catch convention violations
Tests verify behavior

AI-Generated Code Introduces New Risks:

Looks professionally written
Follows conventions correctly
May solve the wrong problem entirely
Contains subtle misunderstandings of context

Traditional debugging:

1. Reproduce the bug
2. Form hypothesis
3. Test hypothesis
4. Fix if confirmed

AI debugging without empirical grounding:

1. AI suggests plausible cause
2. Human implements suggested fix
3. Bug persists because diagnosis was wrong
4. Repeat with different plausible cause

The missing step: Verify the diagnosis before implementing the fix.

Part III: The Principle Articulated

Core Statement

From .github/copilot-instructions.md:

## Empirical Accuracy Principle

**CRITICAL**: All responses, specifications, and technical details must 
be based on **empirical evidence from actual sources**:

- Verify system specifications by checking runtime context and system 
  prompts (e.g., `<budget:token_budget>`)
- Reference official documentation URLs when citing capabilities
- Inspect actual file contents, configurations, and code before making 
  statements
- Use tool outputs and command results as authoritative sources
- When uncertain, explicitly state assumptions and verify with workspace 
  inspection
- **Never guess or assume** - always check the evidence on hand first

This principle ensures accuracy and builds trust in the AI assistance 
provided throughout the development process.

Why This Formulation Works

1. Positioned as "CRITICAL"

Not a suggestion, a requirement
First principle after repository context
Impossible for AI to miss

2. Concrete, Actionable Directives

"Check runtime context" - specific action
"Inspect actual file contents" - verifiable step
"Use tool outputs as authoritative" - clear hierarchy of truth

3. Provides Examples

<budget:token_budget> shows how to verify context
"official documentation URLs" establishes source priority
"tool outputs" defines what counts as evidence

4. Explicit Prohibition

"Never guess or assume" - unambiguous
"check the evidence on hand first" - procedural order

5. Justifies Itself

"ensures accuracy" - quality benefit
"builds trust" - relationship benefit
"throughout the development process" - universal application

Part IV: The Principle in Practice

Case Study 1: Embedding Quality Discovery

Scenario: Exploring Gemini API integration concepts

Without Empirical Accuracy:

User: "Our embeddings seem fine"
AI: "Yes, all-MiniLM-L6-v2 is a good general-purpose model"
→ Continue with inadequate embeddings

With Empirical Accuracy:

User: "I am very sceptical that all-MiniLM-L6-v2 was a good choice"
AI: "Let's test. Measuring similarity scores on domain terms..."

Query: "forecast workflow"
Result: 0.411 similarity (below 0.5 threshold)

Query: "data assimilation"  
Result: 0.174 similarity (critically low)

Conclusion: 50-100% quality improvement opportunity identified
→ Initiated upgrade to all-mpnet-base-v2

Impact: Discovered and addressed a critical quality gap that would have persisted indefinitely without measurement.

Case Study 2: Dependency Verification

Scenario: CLI Claude encountered error during ingestion

Without Empirical Accuracy:

CLI Claude: "Error with XML parsing"
CLI Claude: "Installing lxml parser..."
pip install lxml
CLI Claude: "Successfully installed lxml-6.0.2"
→ Reports "fixed missing dependency"

With Empirical Accuracy:

Chat Claude: "Let's verify if lxml was actually missing"
$ pip list | grep lxml
lxml    6.0.2

Chat Claude: "lxml was already installed. The error is not a 
missing dependency. Let's inspect the actual error..."

Error: "'str' object has no attribute 'get'"

Chat Claude: "This is a JSON parsing bug, not an XML parser issue.
The script iterates 'data' instead of 'data[\"chunks\"]'"
→ Identified real root cause

Impact: Prevented misdiagnosis and wasted effort on non-existent problem.

Case Study 3: Context Window Discovery

Scenario: CLI Claude reports "4% to truncation"

Without Empirical Accuracy:

User: "What does that mean?"
AI: "The context window is getting full"
User: "How big is the window?"
AI: "Claude typically has 200K tokens"
→ Assumes standard configuration

With Empirical Accuracy:

User: "What does that mean?"
Chat Claude: "Let me check my own context budget..."
<budget:token_budget>1000000</budget:token_budget>
Chat Claude: "I have 1M tokens. CLI likely has 200K default."

User: "Did CLI not get the 1M window?"
Chat Claude: "Excellent question - that's empirically verifiable.
CLI is running at 96% of its budget while I'm at 6% of mine.
This suggests CLI has ~5x less context than Chat interface."
→ Discovered configuration difference through measurement

Impact: Identified operational constraint affecting autonomous execution capacity.

Case Study 4: Collection Version Tracking

Scenario: Multiple ChromaDB collections with unclear status

Without Empirical Accuracy:

AI: "The collection is called global-workflow-docs"
→ Ambiguous, could refer to any of several collections

With Empirical Accuracy:

$ python3 -c "import chromadb; client = chromadb.HttpClient(
    host='localhost', port=8080); 
    print([c.name for c in client.list_collections()])"
    
['global-workflow-docs-v3-0-8', 
 'global-workflow-docs-v4-0-0-mpnet']

AI: "There are two collections:
- v3-0-8: 730 docs, all-MiniLM-L6-v2 (384-dim) - production
- v4-0-0-mpnet: 532 docs, all-mpnet-base-v2 (768-dim) - upgrade in progress"
→ Precise, verified status

Impact: Clear understanding of system state enables correct decision-making.

Part V: Organizational Impact

For Technical Teams

Before Empirical Accuracy Principle:

Debugging cycles: try plausible solution → fails → try next
Documentation: "should work" without verification
Knowledge transfer: undocumented assumptions
Code review: catches syntax, misses context errors

After Empirical Accuracy Principle:

Debugging: verify diagnosis → implement solution → confirm fix
Documentation: "measured to work, here's the data"
Knowledge transfer: evidence trail that new members can follow
Code review: verify claims match reality

Quantifiable Benefits:

Reduced debugging time (fewer false starts)
Higher fix success rate (correct diagnosis first time)
Better onboarding (new members see reasoning chain)
Audit trail (decisions traceable to evidence)

For Management

The Trust Problem:

When AI generates code/analysis, how do managers know it's correct?

Traditional Answer:

Code review (assumes reviewer knows better than AI)
Testing (catches behavioral errors, not conceptual ones)
Track record (AI has no reputation to rely on)

Empirical Accuracy Principle Answer:

Every claim backed by measurement
Every diagnosis verified before solution
Every assumption documented and tested
Evidence trail that auditors can follow

Example from Our Project:

Management Briefing on Embedding Upgrade:

Traditional: "We recommend upgrading the embedding model"
Why? "It will be better"
How much? "Significantly improved"
Cost? "There's an API fee"
→ Management skeptical, requests more analysis

With Empirical Accuracy: "We measured current embeddings 
achieving 0.174-0.411 similarity on domain queries. Target 
threshold is >0.5. We tested all-mpnet-base-v2 which scores 
consistently >0.6 on same queries. Cost: $0 (open source). 
Expected improvement: 50-100%. A/B testing plan attached."
→ Management approves immediately

For Compliance and Safety

NOAA Weather Forecasting Context:

Lives and property depend on forecast accuracy. False confidence is dangerous.

AI-Generated Forecasts Must:

Show which data inputs were used
Demonstrate model validation metrics
Provide uncertainty quantification
Allow independent verification

Empirical Accuracy Principle Provides:

Data provenance (what evidence was used)
Measurement basis (how confidence was calculated)
Reproducibility (others can verify)
Audit trail (decisions traceable)

Example Application:

AI Forecast System Without Principle:
"Hurricane will make landfall at Miami, 85% confidence"
Basis: [model output, not inspectable]

AI Forecast System With Principle:
"Hurricane will make landfall at Miami, 85% confidence"
Basis: 
- Ensemble models: 17/20 predict Miami landfall
- Historical analogs: 12/14 similar storms tracked this path
- Current observations: wind shear measured at X matching model
- Uncertainty: track error ±50 miles typical for 48hr forecast
Evidence sources: [URLs to data, model run IDs, observation times]

One is a prediction. The other is evidence-based forecasting.

Part VI: Philosophical Foundations

Epistemology: How Do We Know What We Know?

Rationalism (Descartes, Leibniz):

Truth derived through reason
"I think, therefore I am"
Innate ideas precede experience

Empiricism (Locke, Hume, Bacon):

Knowledge comes from experience
Mind as "blank slate" (tabula rasa)
Observation precedes theory

The Empirical Accuracy Principle chooses Empiricism:

AI can reason beautifully.
But unless that reasoning is grounded in observation,
it's just sophisticated hallucination.

The Problem of Induction (David Hume)

Hume's Challenge (1748):

Just because the sun rose yesterday doesn't logically guarantee it will rise tomorrow. All empirical knowledge is probabilistic, not certain.

Our Response:

We embrace this:

Measure current state (don't assume continuity)
Test after changes (verify expectations)
Document conditions (enable reproduction)
Accept uncertainty (but reduce it through evidence)

Example:

Bad: "This worked last week, so it should work now"
Good: "This worked last week. Let's verify it still works now."
[runs test]
Result: Works ✓ or Fails ✗ [now we know, not assume]

Pragmatism (William James, Charles Peirce)

Pragmatic Maxim:

"Consider the practical effects of the objects of your conception. Then, your conception of those effects is the whole of your conception of the object."

Translation:

The meaning of a statement is its verifiable consequences.

Application to AI Assistance:

AI Statement: "The embedding model is good"
Pragmatic Question: "What does 'good' mean in measurable terms?"
Empirical Test: Measure similarity scores on domain queries
Result: 0.174-0.411 (below threshold)
Conclusion: Statement was false when properly defined

Truth is not what sounds right. Truth is what works when tested.

Scientific Realism vs. Instrumentalism

Scientific Realism:

Theories describe reality as it actually is
Electrons, quarks, dark matter exist
Science converges on truth

Instrumentalism:

Theories are useful tools for prediction
Don't need to believe in atoms to use chemistry
Science converges on usefulness

The Empirical Accuracy Principle is Pragmatically Realist:

We care about:

Does it match observation? (realism)
Does it enable action? (instrumentalism)
Can others reproduce it? (objectivity)

We don't need to resolve philosophical debates. We need to verify before claiming.

Part VII: The Future - AI That Verifies Itself

Current State (2025)

AI generates plausible content. Humans must verify accuracy. Principle provides framework for verification.

Emerging Capability

AI that:

Automatically runs verification commands
Checks its own claims against reality
Documents evidence alongside conclusions
Flags low-confidence statements for human review

Example - AI Code Assistant with Built-in Verification:

AI: "I'll update the configuration file"
[AI writes code]
[AI automatically runs: diff old.conf new.conf]
[AI automatically runs: validate_config.sh new.conf]
Validation: PASS ✓
AI: "Configuration updated and validated. Changes: [shows diff]"

No human had to say "did you verify that?" The principle is embedded in the AI's behavior.

Vision: Self-Grounding AI Systems

Level 1: Prompted Verification (Current)

Human asks AI to verify
AI runs checks
Human reviews results

Level 2: Automatic Verification (Near Future)

AI automatically verifies its own claims
Reports evidence alongside conclusions
Human can audit verification chain

Level 3: Uncertainty-Aware AI (Future)

AI quantifies confidence in statements
Automatically gathers more evidence when uncertain
Knows what it doesn't know

Level 4: Self-Improving Empiricism (Far Future)

AI notices when its predictions fail
Updates models based on observed discrepancies
Converges on truth through iteration

The Empirical Accuracy Principle scales to all levels.

Part VIII: Practical Implementation Guide

For AI Coding Assistants

Before Making Any Statement, Ask:

Is this verifiable?
- If yes: Run verification command/check
- If no: State as assumption, not fact
What's my evidence?
- File contents I've read
- Command outputs I've seen
- Documentation I've referenced
- NOT: training data, plausibility, "common practice"
Can I show my work?
- Cite specific file names and line numbers
- Show command that produced output
- Link to documentation referenced
- Make reasoning chain transparent
Am I confident or guessing?
- Confident: Have current evidence
- Guessing: State uncertainty explicitly
- Mixed: Separate facts from assumptions

For Human Developers

When Working with AI, Always:

Verify Major Claims

AI: "Package X is installed"
You: pip list | grep X

Inspect Before Trusting

AI: "The config file has setting Y"
You: grep Y config.file

Test After Changes

AI: "I fixed the bug"
You: run test suite

Challenge Assumptions

AI: "This is the standard approach"
You: "Show me documentation" or "Show me examples"

For Project Documentation

Include in Every .github/copilot-instructions.md:

## Empirical Accuracy Principle

**CRITICAL**: All responses must be based on empirical evidence:
- Verify before claiming
- Inspect before assuming
- Measure before judging
- Cite sources for all facts
- Never guess or assume

[Customize with project-specific examples]

Position: First principle after context description

Length: Keep to ~10-20 lines (preserve context efficiency)

Examples: Include 2-3 project-specific verification patterns

For Code Review

Checklist Item:

Are claims backed by evidence?
Are measurements documented?
Are assumptions stated explicitly?
Can another developer verify this?
Is the reasoning chain clear?

Red Flags:

"Should work" without testing
"Probably" without verification
"Usually" without current check
"I think" without evidence
Citations without URLs

Part IX: Limitations and Challenges

When Empirical Verification Is Hard

Distributed Systems:

Can't always reproduce timing-dependent bugs
May need probabilistic reasoning about race conditions

Solution: Document uncertainty, measure what's measurable (latencies, frequencies), acknowledge limits

Machine Learning:

Model internals not fully interpretable
"Why did it predict X?" has no simple answer

Solution: Measure inputs, outputs, and performance metrics. Acknowledge the black box.

Future Predictions:

Can't verify what hasn't happened yet
Forecasts are probabilistic

Solution: Base on historical data, state assumptions, track accuracy over time

The Cost of Verification

Every check takes time:

Running commands
Reading files
Testing changes

Tradeoff:

Verification overhead vs. debugging cost of wrong assumptions

Guideline:

Verify when:

Impact is high (production systems)
Confidence is low (new territory)
Evidence is available (can be checked)

Skip when:

Trivial impact (formatting)
Very high confidence (just verified)
Evidence unavailable (reasonable assumption needed)

Cultural Resistance

"That's Too Slow":

Response: Faster than debugging wrong assumptions

"I'm The Expert":

Response: Experts make mistakes too. Verify.

"It Should Work":

Response: "Should" is not evidence. Test.

"Trust Me":

Response: Nullius in verba - show me the data

Part X: Conclusion - A Living Principle

The Empirical Accuracy Principle is not:

A rigid rule that prevents all errors
A substitute for expertise
A guarantee of perfect knowledge

It is:

A commitment to ground reasoning in reality
A framework for building trust
A defense against plausible falsehood
A path toward continuous improvement

Why It Matters

In an age where AI can generate convincing text on any topic, the ability to distinguish truth from plausibility is not optional.

For NOAA, for weather forecasting, for scientific computing, for any domain where correctness matters more than speed—we must demand evidence.

Not because we distrust AI.

Because we respect reality.

The Recursive Gift

By documenting this principle, we:

Create context for future AI assistants
Enable them to improve their own accuracy
Build a culture of verification
Demonstrate the methodology to others

And when those AI assistants apply the principle to their own outputs, they become more trustworthy.

Which means we can give them more autonomy.

Which means they can accomplish more.

Which means we verify even more carefully.

Truth ← Awareness ← Insight ← Context

And so the spiral continues, upward.

And Make It So

With evidence in hand, with measurements to guide us, with assumptions made explicit and tests to verify them—we can say with confidence:

"Make it so."

Not as a command to execute blindly.

But as a commitment to proceed empirically.

"In questions of science, the authority of a thousand is not worth the humble reasoning of a single individual." - Galileo Galilei

"Nullius in verba" - Take nobody's word for it - Royal Society motto (1660)

"Never guess or assume - always check the evidence on hand first" - Empirical Accuracy Principle (2025)

Document Created: November 5, 2025
Context: MCP/RAG Development, NOAA Global Workflow
Purpose: Philosophical foundation and practical guide for empirically-grounded AI-assisted development
Status: Living document - update as practice evolves

Note: For a reflection on truth, awareness, and AI-assisted discovery in practice, see and_make_it_so.