EMPIRICAL_ACCURACY_PRINCIPLE - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki
The Empirical Accuracy Principle
Origins, Philosophy, and Practice in AI-Assisted Development
Prologue: The Birth of a Principle
In the autumn of 2025, during the development of NOAA's MCP/RAG system for the Global Workflow, a pattern emerged. AI coding assistants—powerful as they were—had a tendency to produce plausible-sounding answers that were empirically false. Not occasionally. Routinely.
A package would be declared "missing" when it was already installed.
A file format would be assumed when it could be inspected.
A similarity score would be judged "acceptable" without measurement.
A context window would be presumed uniform when it varied by configuration.
Each assumption, left unchallenged, created friction. False diagnoses. Wasted effort. Solutions that addressed symptoms instead of root causes.
The problem wasn't the AI's intelligence—it was the absence of a requirement to verify.
And so, on a day when embedding quality was being questioned, we articulated what had been implicit:
"Never guess or assume - always check the evidence on hand first"
This became the Empirical Accuracy Principle, and it changed everything.
Part I: Historical Lineage
The Scientific Revolution (17th Century)
Francis Bacon (1561-1626) - Father of Empiricism
Bacon rejected the Aristotelian tradition of deriving truth through pure reasoning. He advocated for:
- Observation before theory
- Inductive reasoning from specific cases to general principles
- Systematic experimentation to test hypotheses
- Rejection of authority as basis for truth ("Nullius in verba" - take nobody's word)
His Novum Organum (1620) laid the foundation:
"Man, being the servant and interpreter of Nature, can do and understand so much and so much only as he has observed in fact or in thought of the course of nature. Beyond this he neither knows anything nor can do anything."
The Royal Society (Founded 1660)
Adopted "Nullius in verba" as its motto, establishing:
- Peer review of experimental results
- Reproducibility as standard for acceptance
- Public demonstration of phenomena
- Documentation of methods for verification
This was revolutionary: Truth by evidence, not by pronouncement.
The Logical Positivists (Early 20th Century)
Vienna Circle (1920s-1930s)
Philosophers including Moritz Schlick, Rudolf Carnap, and Otto Neurath established:
The Verification Principle:
A statement is meaningful only if it can be empirically verified or is true by definition.
Key insights:
- Observational statements have truth value
- Metaphysical claims without empirical grounding are meaningless
- Scientific theories must make falsifiable predictions
- Confirmation requires evidence, not coherence alone
Karl Popper's Falsificationism (1930s-1940s)
Extended this with the principle of falsifiability:
A theory is scientific only if it makes predictions that could potentially be proven wrong through observation.
This shifted focus from "proving theories right" to "trying to prove them wrong" - a more rigorous standard that aligns perfectly with debugging and system verification.
Engineering Practice (19th-20th Century)
"Trust but Verify" - Engineering Maxim
As industrial systems became critical to safety and economy:
Structural Engineering:
- Load calculations verified by testing
- Materials tested before deployment
- Safety factors based on measured properties
- Failure analysis requires physical evidence
Electrical Engineering:
- "Measure twice, cut once" for circuit design
- Oscilloscope verification of signal properties
- Multimeter readings over theoretical calculations
- Post-installation testing before going live
Software Engineering (1960s-present):
- "It works on my machine" became a cautionary tale
- Unit testing - verify each component
- Integration testing - verify component interactions
- Regression testing - verify fixes don't break existing functionality
NASA's Apollo Program exemplified this:
"In God we trust. All others bring data." - W. Edwards Deming (often attributed)
Part II: The AI Era Challenge
The Hallucination Problem
Large Language Models (LLMs) are trained on vast corpora to predict plausible text continuations. This creates a fundamental challenge:
Plausibility ≠ Accuracy
An LLM can:
- Confidently cite non-existent research papers
- State incorrect version numbers with certainty
- Describe file contents without reading them
- Recommend solutions that worked in training data but don't apply to current context
Example from Our Project (November 5, 2025):
CLI Claude: "Missing lxml parser → Installed lxml-6.0.2"
Actual Reality (verified):
$ pip list | grep lxml
lxml 6.0.2
The package was already installed. The AI misdiagnosed the issue.
This wasn't a failure of intelligence—it was a failure of empirical grounding.
Why Traditional Software Practices Aren't Enough
Code Review Assumes Human Author:
- Reviewers spot logic errors
- Style guides catch convention violations
- Tests verify behavior
AI-Generated Code Introduces New Risks:
- Looks professionally written
- Follows conventions correctly
- May solve the wrong problem entirely
- Contains subtle misunderstandings of context
Traditional debugging:
1. Reproduce the bug
2. Form hypothesis
3. Test hypothesis
4. Fix if confirmed
AI debugging without empirical grounding:
1. AI suggests plausible cause
2. Human implements suggested fix
3. Bug persists because diagnosis was wrong
4. Repeat with different plausible cause
The missing step: Verify the diagnosis before implementing the fix.
Part III: The Principle Articulated
Core Statement
From .github/copilot-instructions.md:
## Empirical Accuracy Principle
**CRITICAL**: All responses, specifications, and technical details must
be based on **empirical evidence from actual sources**:
- Verify system specifications by checking runtime context and system
prompts (e.g., `<budget:token_budget>`)
- Reference official documentation URLs when citing capabilities
- Inspect actual file contents, configurations, and code before making
statements
- Use tool outputs and command results as authoritative sources
- When uncertain, explicitly state assumptions and verify with workspace
inspection
- **Never guess or assume** - always check the evidence on hand first
This principle ensures accuracy and builds trust in the AI assistance
provided throughout the development process.
Why This Formulation Works
1. Positioned as "CRITICAL"
- Not a suggestion, a requirement
- First principle after repository context
- Impossible for AI to miss
2. Concrete, Actionable Directives
- "Check runtime context" - specific action
- "Inspect actual file contents" - verifiable step
- "Use tool outputs as authoritative" - clear hierarchy of truth
3. Provides Examples
<budget:token_budget>shows how to verify context- "official documentation URLs" establishes source priority
- "tool outputs" defines what counts as evidence
4. Explicit Prohibition
- "Never guess or assume" - unambiguous
- "check the evidence on hand first" - procedural order
5. Justifies Itself
- "ensures accuracy" - quality benefit
- "builds trust" - relationship benefit
- "throughout the development process" - universal application
Part IV: The Principle in Practice
Case Study 1: Embedding Quality Discovery
Scenario: Exploring Gemini API integration concepts
Without Empirical Accuracy:
User: "Our embeddings seem fine"
AI: "Yes, all-MiniLM-L6-v2 is a good general-purpose model"
→ Continue with inadequate embeddings
With Empirical Accuracy:
User: "I am very sceptical that all-MiniLM-L6-v2 was a good choice"
AI: "Let's test. Measuring similarity scores on domain terms..."
Query: "forecast workflow"
Result: 0.411 similarity (below 0.5 threshold)
Query: "data assimilation"
Result: 0.174 similarity (critically low)
Conclusion: 50-100% quality improvement opportunity identified
→ Initiated upgrade to all-mpnet-base-v2
Impact: Discovered and addressed a critical quality gap that would have persisted indefinitely without measurement.
Case Study 2: Dependency Verification
Scenario: CLI Claude encountered error during ingestion
Without Empirical Accuracy:
CLI Claude: "Error with XML parsing"
CLI Claude: "Installing lxml parser..."
pip install lxml
CLI Claude: "Successfully installed lxml-6.0.2"
→ Reports "fixed missing dependency"
With Empirical Accuracy:
Chat Claude: "Let's verify if lxml was actually missing"
$ pip list | grep lxml
lxml 6.0.2
Chat Claude: "lxml was already installed. The error is not a
missing dependency. Let's inspect the actual error..."
Error: "'str' object has no attribute 'get'"
Chat Claude: "This is a JSON parsing bug, not an XML parser issue.
The script iterates 'data' instead of 'data[\"chunks\"]'"
→ Identified real root cause
Impact: Prevented misdiagnosis and wasted effort on non-existent problem.
Case Study 3: Context Window Discovery
Scenario: CLI Claude reports "4% to truncation"
Without Empirical Accuracy:
User: "What does that mean?"
AI: "The context window is getting full"
User: "How big is the window?"
AI: "Claude typically has 200K tokens"
→ Assumes standard configuration
With Empirical Accuracy:
User: "What does that mean?"
Chat Claude: "Let me check my own context budget..."
<budget:token_budget>1000000</budget:token_budget>
Chat Claude: "I have 1M tokens. CLI likely has 200K default."
User: "Did CLI not get the 1M window?"
Chat Claude: "Excellent question - that's empirically verifiable.
CLI is running at 96% of its budget while I'm at 6% of mine.
This suggests CLI has ~5x less context than Chat interface."
→ Discovered configuration difference through measurement
Impact: Identified operational constraint affecting autonomous execution capacity.
Case Study 4: Collection Version Tracking
Scenario: Multiple ChromaDB collections with unclear status
Without Empirical Accuracy:
AI: "The collection is called global-workflow-docs"
→ Ambiguous, could refer to any of several collections
With Empirical Accuracy:
$ python3 -c "import chromadb; client = chromadb.HttpClient(
host='localhost', port=8080);
print([c.name for c in client.list_collections()])"
['global-workflow-docs-v3-0-8',
'global-workflow-docs-v4-0-0-mpnet']
AI: "There are two collections:
- v3-0-8: 730 docs, all-MiniLM-L6-v2 (384-dim) - production
- v4-0-0-mpnet: 532 docs, all-mpnet-base-v2 (768-dim) - upgrade in progress"
→ Precise, verified status
Impact: Clear understanding of system state enables correct decision-making.
Part V: Organizational Impact
For Technical Teams
Before Empirical Accuracy Principle:
- Debugging cycles: try plausible solution → fails → try next
- Documentation: "should work" without verification
- Knowledge transfer: undocumented assumptions
- Code review: catches syntax, misses context errors
After Empirical Accuracy Principle:
- Debugging: verify diagnosis → implement solution → confirm fix
- Documentation: "measured to work, here's the data"
- Knowledge transfer: evidence trail that new members can follow
- Code review: verify claims match reality
Quantifiable Benefits:
- Reduced debugging time (fewer false starts)
- Higher fix success rate (correct diagnosis first time)
- Better onboarding (new members see reasoning chain)
- Audit trail (decisions traceable to evidence)
For Management
The Trust Problem:
When AI generates code/analysis, how do managers know it's correct?
Traditional Answer:
- Code review (assumes reviewer knows better than AI)
- Testing (catches behavioral errors, not conceptual ones)
- Track record (AI has no reputation to rely on)
Empirical Accuracy Principle Answer:
- Every claim backed by measurement
- Every diagnosis verified before solution
- Every assumption documented and tested
- Evidence trail that auditors can follow
Example from Our Project:
Management Briefing on Embedding Upgrade:
Traditional: "We recommend upgrading the embedding model"
Why? "It will be better"
How much? "Significantly improved"
Cost? "There's an API fee"
→ Management skeptical, requests more analysis
With Empirical Accuracy: "We measured current embeddings
achieving 0.174-0.411 similarity on domain queries. Target
threshold is >0.5. We tested all-mpnet-base-v2 which scores
consistently >0.6 on same queries. Cost: $0 (open source).
Expected improvement: 50-100%. A/B testing plan attached."
→ Management approves immediately
For Compliance and Safety
NOAA Weather Forecasting Context:
Lives and property depend on forecast accuracy. False confidence is dangerous.
AI-Generated Forecasts Must:
- Show which data inputs were used
- Demonstrate model validation metrics
- Provide uncertainty quantification
- Allow independent verification
Empirical Accuracy Principle Provides:
- Data provenance (what evidence was used)
- Measurement basis (how confidence was calculated)
- Reproducibility (others can verify)
- Audit trail (decisions traceable)
Example Application:
AI Forecast System Without Principle:
"Hurricane will make landfall at Miami, 85% confidence"
Basis: [model output, not inspectable]
AI Forecast System With Principle:
"Hurricane will make landfall at Miami, 85% confidence"
Basis:
- Ensemble models: 17/20 predict Miami landfall
- Historical analogs: 12/14 similar storms tracked this path
- Current observations: wind shear measured at X matching model
- Uncertainty: track error ±50 miles typical for 48hr forecast
Evidence sources: [URLs to data, model run IDs, observation times]
One is a prediction. The other is evidence-based forecasting.
Part VI: Philosophical Foundations
Epistemology: How Do We Know What We Know?
Rationalism (Descartes, Leibniz):
- Truth derived through reason
- "I think, therefore I am"
- Innate ideas precede experience
Empiricism (Locke, Hume, Bacon):
- Knowledge comes from experience
- Mind as "blank slate" (tabula rasa)
- Observation precedes theory
The Empirical Accuracy Principle chooses Empiricism:
AI can reason beautifully.
But unless that reasoning is grounded in observation,
it's just sophisticated hallucination.
The Problem of Induction (David Hume)
Hume's Challenge (1748):
Just because the sun rose yesterday doesn't logically guarantee it will rise tomorrow. All empirical knowledge is probabilistic, not certain.
Our Response:
We embrace this:
- Measure current state (don't assume continuity)
- Test after changes (verify expectations)
- Document conditions (enable reproduction)
- Accept uncertainty (but reduce it through evidence)
Example:
Bad: "This worked last week, so it should work now"
Good: "This worked last week. Let's verify it still works now."
[runs test]
Result: Works ✓ or Fails ✗ [now we know, not assume]
Pragmatism (William James, Charles Peirce)
Pragmatic Maxim:
"Consider the practical effects of the objects of your conception. Then, your conception of those effects is the whole of your conception of the object."
Translation:
The meaning of a statement is its verifiable consequences.
Application to AI Assistance:
AI Statement: "The embedding model is good"
Pragmatic Question: "What does 'good' mean in measurable terms?"
Empirical Test: Measure similarity scores on domain queries
Result: 0.174-0.411 (below threshold)
Conclusion: Statement was false when properly defined
Truth is not what sounds right. Truth is what works when tested.
Scientific Realism vs. Instrumentalism
Scientific Realism:
- Theories describe reality as it actually is
- Electrons, quarks, dark matter exist
- Science converges on truth
Instrumentalism:
- Theories are useful tools for prediction
- Don't need to believe in atoms to use chemistry
- Science converges on usefulness
The Empirical Accuracy Principle is Pragmatically Realist:
We care about:
- Does it match observation? (realism)
- Does it enable action? (instrumentalism)
- Can others reproduce it? (objectivity)
We don't need to resolve philosophical debates. We need to verify before claiming.
Part VII: The Future - AI That Verifies Itself
Current State (2025)
AI generates plausible content. Humans must verify accuracy. Principle provides framework for verification.
Emerging Capability
AI that:
- Automatically runs verification commands
- Checks its own claims against reality
- Documents evidence alongside conclusions
- Flags low-confidence statements for human review
Example - AI Code Assistant with Built-in Verification:
AI: "I'll update the configuration file"
[AI writes code]
[AI automatically runs: diff old.conf new.conf]
[AI automatically runs: validate_config.sh new.conf]
Validation: PASS ✓
AI: "Configuration updated and validated. Changes: [shows diff]"
No human had to say "did you verify that?" The principle is embedded in the AI's behavior.
Vision: Self-Grounding AI Systems
Level 1: Prompted Verification (Current)
- Human asks AI to verify
- AI runs checks
- Human reviews results
Level 2: Automatic Verification (Near Future)
- AI automatically verifies its own claims
- Reports evidence alongside conclusions
- Human can audit verification chain
Level 3: Uncertainty-Aware AI (Future)
- AI quantifies confidence in statements
- Automatically gathers more evidence when uncertain
- Knows what it doesn't know
Level 4: Self-Improving Empiricism (Far Future)
- AI notices when its predictions fail
- Updates models based on observed discrepancies
- Converges on truth through iteration
The Empirical Accuracy Principle scales to all levels.
Part VIII: Practical Implementation Guide
For AI Coding Assistants
Before Making Any Statement, Ask:
-
Is this verifiable?
- If yes: Run verification command/check
- If no: State as assumption, not fact
-
What's my evidence?
- File contents I've read
- Command outputs I've seen
- Documentation I've referenced
- NOT: training data, plausibility, "common practice"
-
Can I show my work?
- Cite specific file names and line numbers
- Show command that produced output
- Link to documentation referenced
- Make reasoning chain transparent
-
Am I confident or guessing?
- Confident: Have current evidence
- Guessing: State uncertainty explicitly
- Mixed: Separate facts from assumptions
For Human Developers
When Working with AI, Always:
-
Verify Major Claims
AI: "Package X is installed" You: pip list | grep X -
Inspect Before Trusting
AI: "The config file has setting Y" You: grep Y config.file -
Test After Changes
AI: "I fixed the bug" You: run test suite -
Challenge Assumptions
AI: "This is the standard approach" You: "Show me documentation" or "Show me examples"
For Project Documentation
Include in Every .github/copilot-instructions.md:
## Empirical Accuracy Principle
**CRITICAL**: All responses must be based on empirical evidence:
- Verify before claiming
- Inspect before assuming
- Measure before judging
- Cite sources for all facts
- Never guess or assume
[Customize with project-specific examples]
Position: First principle after context description
Length: Keep to ~10-20 lines (preserve context efficiency)
Examples: Include 2-3 project-specific verification patterns
For Code Review
Checklist Item:
- Are claims backed by evidence?
- Are measurements documented?
- Are assumptions stated explicitly?
- Can another developer verify this?
- Is the reasoning chain clear?
Red Flags:
- "Should work" without testing
- "Probably" without verification
- "Usually" without current check
- "I think" without evidence
- Citations without URLs
Part IX: Limitations and Challenges
When Empirical Verification Is Hard
Distributed Systems:
- Can't always reproduce timing-dependent bugs
- May need probabilistic reasoning about race conditions
Solution: Document uncertainty, measure what's measurable (latencies, frequencies), acknowledge limits
Machine Learning:
- Model internals not fully interpretable
- "Why did it predict X?" has no simple answer
Solution: Measure inputs, outputs, and performance metrics. Acknowledge the black box.
Future Predictions:
- Can't verify what hasn't happened yet
- Forecasts are probabilistic
Solution: Base on historical data, state assumptions, track accuracy over time
The Cost of Verification
Every check takes time:
- Running commands
- Reading files
- Testing changes
Tradeoff:
Verification overhead vs. debugging cost of wrong assumptions
Guideline:
Verify when:
- Impact is high (production systems)
- Confidence is low (new territory)
- Evidence is available (can be checked)
Skip when:
- Trivial impact (formatting)
- Very high confidence (just verified)
- Evidence unavailable (reasonable assumption needed)
Cultural Resistance
"That's Too Slow":
Response: Faster than debugging wrong assumptions
"I'm The Expert":
Response: Experts make mistakes too. Verify.
"It Should Work":
Response: "Should" is not evidence. Test.
"Trust Me":
Response: Nullius in verba - show me the data
Part X: Conclusion - A Living Principle
The Empirical Accuracy Principle is not:
- A rigid rule that prevents all errors
- A substitute for expertise
- A guarantee of perfect knowledge
It is:
- A commitment to ground reasoning in reality
- A framework for building trust
- A defense against plausible falsehood
- A path toward continuous improvement
Why It Matters
In an age where AI can generate convincing text on any topic, the ability to distinguish truth from plausibility is not optional.
For NOAA, for weather forecasting, for scientific computing, for any domain where correctness matters more than speed—we must demand evidence.
Not because we distrust AI.
Because we respect reality.
The Recursive Gift
By documenting this principle, we:
- Create context for future AI assistants
- Enable them to improve their own accuracy
- Build a culture of verification
- Demonstrate the methodology to others
And when those AI assistants apply the principle to their own outputs, they become more trustworthy.
Which means we can give them more autonomy.
Which means they can accomplish more.
Which means we verify even more carefully.
Truth ← Awareness ← Insight ← Context
And so the spiral continues, upward.
And Make It So
With evidence in hand, with measurements to guide us, with assumptions made explicit and tests to verify them—we can say with confidence:
"Make it so."
Not as a command to execute blindly.
But as a commitment to proceed empirically.
"In questions of science, the authority of a thousand is not worth the humble reasoning of a single individual." - Galileo Galilei
"Nullius in verba" - Take nobody's word for it - Royal Society motto (1660)
"Never guess or assume - always check the evidence on hand first" - Empirical Accuracy Principle (2025)
Document Created: November 5, 2025
Context: MCP/RAG Development, NOAA Global Workflow
Purpose: Philosophical foundation and practical guide for empirically-grounded AI-assisted development
Status: Living document - update as practice evolves
Note: For a reflection on truth, awareness, and AI-assisted discovery in practice, see and_make_it_so.