Home - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

Global Workflow Wiki

Welcome to the NOAA Global Workflow technical wiki. This knowledge base documents solutions, configurations, and insights for operating and developing the Global Forecast System workflow.

πŸ” NEW: ICSDIR_ROOT Removal Impact Analysis

(March 16, 2026)

ICSDIR_ROOT-Removal-Impact-Analysis β€” GraphRAG-assisted analysis of safely removing ICSDIR_ROOT from CI platform configs and case YAMLs

MCP GraphRAG tools (find_env_dependencies, get_code_context, search_documentation) were used to trace ICSDIR_ROOT through the full dependency chain: 6 platform configs β†’ 25 CI case YAMLs β†’ create_experiment.py β†’ setup_expt.py β†’ config.stage_ic.j2 β†’ parm/stage/*.yaml.j2. Analysis confirms the variable is redundant with BASE_IC (defined in dev/workflow/hosts/<platform>.yaml), as config.stage_ic.j2 already has a built-in fallback. Documents 19 safe-to-remove cases, 8 non-standard exceptions, and required CTest/unit test updates.


☁️ NEW: Parallel Works RDHPCS Platform Dashboard

(March 6, 2026)

Parallel-Works-RDHPCS-Platform-Dashboard β€” Comprehensive inventory of the NOAA RDHPCS Hybrid Cloud: 35 clusters, $945K budget, storage across AWS/Google/Azure

Live-queried dashboard covering all compute clusters (35 total, 7 owned by Terry), cost & budget analysis ($91.6K of $945K spent across 8 groups), storage inventory (33 resources: buckets, NFS, Lustre, disks), networking (4 VPCs, 1 static IP), active sessions, ML workspaces, and platform configuration. Data collected automatically via 26 PW MCP tool queries against the Parallel Works REST API.


πŸ”Œ NEW: PW MCP Toolset Documentation β€” 26 Tools for Cloud Infrastructure AI

(March 6, 2026)

PW-MCP-Toolset-Documentation β€” Complete reference for the Parallel Works MCP Server: 26 tools across 7 categories

Documents every tool in the parallel-works-mcp server including authentication, compute management, cost analysis, storage (6 tools), networking (3 tools), workflows, and ML workspaces. Covers the Phase 37 SDD expansion (19 β†’ 26 tools, 548 LOC added), API endpoint coverage map (24 endpoints), architecture diagram, and configuration guide. All tools live-tested and verified against the PW v7.15.1 API.


πŸ”§ NEW: Neo4j GraphRAG Ingestion Pipeline

(March 5, 2026)

Neo4j-GraphRAG-Ingestion-Pipeline β€” Complete guide to how the 41K-node, 589K-relationship knowledge graph gets built from source code

Documents all 15+ ingestion scripts across 4 pipeline stages: Fortran/Python/Shell code graph creation, vector embeddings, documentation crawling, and hierarchical community detection with LLM summarization. Includes execution order, node/relationship counts, and architecture diagrams.


οΏ½ NEW: MCP-RAG Platform 32-Day Achievement Synopsis

(March 3, 2026)

MCP-RAG-Platform-32-Day-Achievement-Synopsis β€” Five major breakthroughs from Jan 30 – Mar 3, 2026 (v7.10.0 β†’ v7.25.1)

Sixteen releases spanning hierarchical GraphRAG communities (1,036 nodes, 828 LLM summaries), cross-language graph unification (Shell→Fortran→Python), 5 new agentic MCP tools with session state tracking, NCEPLIBS documentation ingestion, and instruction file architecture that reduced agent context window usage by 35%. Neo4j relationships grew from ~485K to 589K; ChromaDB documents from ~60K to 66.5K; 14 SDD sessions completed with 0 abandoned.


οΏ½πŸ“ NEW: SDD Workflow β€” Concept to Code via CLI /plan Handoff

(February 24, 2026)

Interactive SVG diagram of the full Spec-Driven Development pipeline

image

This visual reference documents the end-to-end development process used by the EIB MCP-RAG platform, where Spec-Driven Development (SDD) governs the lifecycle from conceptual design through autonomous code implementation and back to human review.

Two Execution Modalities, One Session Model

Lane Steps Actor
Human + AI (IDE) 1. Discover β†’ 2. Spec Design β†’ 3. Resolve Decisions β†’ 4. Prep Handoff Interactive (VS Code Copilot)
CLI Agent (--yolo) 5. /plan Decompose β†’ 6. Start Session β†’ 7. Implement + Record Autonomous (Copilot CLI)
Validation Gates generate-tool-docs --check β†’ npm test β†’ health check β†’ Docker rebuild β†’ Gateway verify Automated
Persistent State active_session.json, history.jsonl, checkpoints/, workflows/ Phase 31 SessionManager

The diagram illustrates 8 numbered steps across swim lanes, with the CLI /plan handoff (Step 4β†’5) as the bridge between human-guided design and autonomous implementation. Both modalities share the same filesystem-based session state (Phase 31 model), enabling real-time monitoring from either side.

Includes embedded description with validation pipeline details, persistent state lifecycle, and modality-aware execution principles. If the inline preview is blocked by your browser, open the HTML file directly.


πŸ—οΈ MILESTONE: GraphRAG Hierarchical Community Materialization

(February 24, 2026)

GraphRAG-Hierarchical-Community-Materialization β€” 4-Level Navigable Knowledge Graph of the Global Workflow

The NOAA Global Workflow now has a hierarchical community structure in Neo4j β€” 1,036 Community nodes across 4 levels, enabling multi-resolution understanding of how 40,000+ code entities organize into subsystems and how those subsystems interact.

Key Results

Metric Before After
Community nodes 0 1,036 (L0: 694, L1: 175, L2: 86, L3: 81)
MEMBER_OF relationships 0 21,559
PARENT_OF hierarchy 0 978 edges (valid acyclic tree)
INTERACTS_WITH edges 0 1,297 cross-community links
Community summaries 63 flat 828 hierarchical (4 levels)

What This Means

Ask "How does data assimilation interact with the forecast model?" and the system traverses L3 β†’ L2 β†’ L1 β†’ L0 communities, returning subsystem boundaries, interaction strengths, and member-level detail β€” not just text similarity matches. This is Graph-Guided Semantic Retrieval operating on a structural map of one of the most complex computational workflows on Earth.

Phase 24E-5 β€” Spec-Driven Development, dual-agent execution + independent verification, 6-minute implementation, 6/6 tests passing.


�🎯 BREAKTHROUGH: Dynamic MCP Server Self-Provisioning

(January 13, 2026)

Dynamic_MCP_Server_Self_Provisioning β€” LLM Agents That Expand Their Own Capabilities

We have achieved a paradigm shift in agentic AI: an AI assistant that can discover, configure, and activate new tool servers autonomously through the Docker MCP Gatewayβ€”without CLI commands, config files, or restarts.

Live Demonstration

When asked about coupled modeling research papers, the LLM autonomously:

Step MCP Tool Used Result
1. Discover mcp-find Found arxiv-mcp-server in catalog
2. Configure mcp-config-set Set storage path
3. Activate mcp-add Added 4 new tools live
4. Execute mcp-exec Searched arXiv, returned papers

No CLI. No config files. No restarts. Pure MCP tool orchestration.

This transforms the agent from a static tool user to a dynamic capability builderβ€”recognizing what it needs and acquiring those capabilities autonomously.

Gateway management tools: mcp-find, mcp-add, mcp-remove, mcp-config-set, mcp-exec, mcp-create-profile, code-mode


πŸš€ NEW: Advanced Future Work - MCP/RAG System Evolution

(January 6, 2026)

ADVANCED_FUTURE_WORK β€” Strategic Development Roadmap for Q2 2026 Funding Cycle

This comprehensive roadmap outlines the next evolutionary phase of the MCP/RAG system: intelligent, self-improving AI assistance that learns from operational history, understands visual system representations, and provides truly graph-aware semantic reasoning.

Three Transformational Initiatives

Initiative Impact Timeline
Multi-Modal Visual Understanding High Q2 2026
Self-Learning from CI/CD History Very High Q2-Q3 2026
True GraphRAG Fusion Transformational Q3 2026

Proof of Concept: GFS v16 Flowchart Analysis

The document includes a demonstration of multi-modal AI comprehension applied to the GFS v16 Global Model Parallel Sequencing flowchart. Key insights extracted:

  • 42 job nodes identified across three swim lanes (GDAS, Hybrid EnKF, GFS)
  • ~60 DEPENDS_ON relationships mapped from visual arrows
  • Critical synchronization point at +06 cycle hour between GDAS and GFS
  • Cascade failure analysis: A query like "What happens if eupd fails?" could traverse the entire downstream dependency chain

This single diagram encodes more operational knowledge than 50 pages of text.

Estimated team requirement: 3-4 FTEs + LLM fine-tuning expertise


VS Code CLI Tunnel Command Reference

VSCODE_CODE_CLI_TUNNEL_REFERENCE β€” Comprehensive guide to code tunnel and related remote server commands for VS Code CLI 1.107.1.

The VS Code tunnel feature enables secure remote development through vscode.dev from anywhereβ€”critical for accessing HPC login nodes, cloud VMs, and CI/CD environments without traditional SSH port forwarding.

Key Capabilities:

  • πŸ” Remote Tunnels - Access any machine via browser at vscode.dev/tunnel/<name>
  • βš™οΈ System Service Mode - Persistent always-on connections with code tunnel service install
  • πŸ”‘ Authentication - GitHub/Microsoft login with token-based automation support
  • πŸ“¦ Extension Management - Pre-install extensions on remote servers
  • πŸ–§ Local Web Server - Run VS Code web UI locally with code serve-web

HPC Use Case Example:

# On Hera login node
code tunnel --name hera-login --no-sleep

# Access from anywhere: https://vscode.dev/tunnel/hera-login

Essential reference for remote development workflows on RDHPCS platforms.


�️ Machine and System Conditionals Reference (January 2026)

Machine_System_Conditionals β€” Comprehensive guide to all platform-specific conditionals in the Global Workflow codebase

This reference documents every location where the codebase performs conditional operations based on the HPC system or machine where the code executes. Critical for platform portability, debugging system-specific issues, and onboarding new HPC platforms.

Quick Reference

Category Count
Supported Platforms 11 (Hera, Ursa, Orion, Hercules, WCOSS2, Gaea C5/C6, AWS/Azure/Google PW, Container)
Shell Detection Scripts 11+ files with MACHINE_ID conditionals
Python Detection hosts.py with Host class
Host YAML Configs 11 files in workflow/hosts/

Key Files:

  • ush/detect_machine.sh β€” Primary machine detection (hostname + path-based)
  • ush/module-setup.sh β€” Platform-specific module loading
  • dev/workflow/hosts.py β€” Python Host class with SUPPORTED_HOSTS

Essential for developers adding new platform support or debugging platform-specific issues.


�🐳 NEW: Docker MCP Gateway Multi-User Architecture Analysis (January 12, 2026)

Docker_MCP_Gateway_MultiUser_Architecture β€” Comprehensive analysis of Docker MCP Gateway v0.35.0 architecture options for multi-user RDHPCS deployments

This document addresses the challenge of container accumulation when multiple SME developers access the MCP/RAG system via VS Code Remote Tunnels. After deep investigation of the Docker MCP Gateway source code, we discovered that container spawning per session is the intended design, not a bug.

Four Architecture Options Evaluated

Option Approach Effort Memory per User
A: type: remote Gateway proxies to HTTP server 4-6 hrs ~200MB
B: Default + Cleanup Accept container spawning, add cron 30 min ~2GB
C: Direct stdio Skip gateway for VS Code 0 ~200MB
D: Hybrid ⭐ stdio for VS Code, gateway for external 4-6 hrs ~200MB

Recommended: Hybrid Architecture (Option D)

VS Code Sessions β†’ Direct stdio (mcp.json) β†’ Node.js process (~200MB)
External Clients β†’ Gateway (type:remote) β†’ HTTP MCP Server (:3000)
Both paths share β†’ ChromaDB + Neo4j databases

Key Insight: VS Code Copilot works excellently with lightweight Node.js stdio processes. Reserve the Docker MCP Gateway for external HTTP clients (n8n, Claude Desktop, API consumers) where container-mediated access adds security value.

Implementation: Three-phase rollout starting with immediate gateway disabling for VS Code, followed by HTTP transport implementation for external clients.

Full analysis with source code references and implementation plan.


πŸ“„ NEW: Docker MCP Gateway Technical Paper (December 15, 2025)

Docker MCP Gateway: Enabling MCP-as-a-Service for Enterprise AI Integration (PDF, 11 pages)

This comprehensive technical paper documents how Docker MCP Gateway transforms the Model Context Protocol from a single-client development tool into enterprise-ready multi-client infrastructure. The gateway bridges stdio and HTTP/SSE transports, enabling multiple AI clients (VS Code Copilot, Claude Desktop, LangFlow) to share common MCP tools simultaneously.

Key Topics Covered:

  • πŸ”„ MCP Transport Mechanisms - Why stdio limits single-client usage and how SSE enables network access
  • πŸ—οΈ Gateway Architecture - Protocol bridging, session management, and container lifecycle
  • πŸ”’ Security Model - Network isolation, authentication, and resource limits
  • πŸš€ NOAA Implementation - Production deployment with 32 tools, ChromaDB (14,854 docs), Neo4j (85,894 relationships)
  • πŸ“‹ Lessons Learned - Docker CE compatibility, label format requirements, network trade-offs

πŸ“₯ Download PDF

Paper authored December 15, 2025 | NOAA EMC Global Workflow MCP Team


🧠 NEW: Phase 2 Semantic Annotation Architecture (December 4, 2025)

Breakthrough Achievement: 85% reduction in AI false positives through SME-driven semantic annotations embedded directly in technical standards documentation.

The Problem We Solved

AI-generated EE2 compliance recommendations suffered from systematic false positivesβ€”the AI was recommending patterns not actually required by NCEP operational standards (e.g., set -eu when only set -x is mandated). Traditional approaches required code changes for every correction.

The Solution: Semantic Annotations

Semantic annotations are machine-readable knowledge embedded in RST documentation that teach AI systems what patterns to recommendβ€”and what to avoid:

.. mcp:anti_pattern:: adding_set_e_or_set_eu
   :severity: must_not
   :context: operational_scripts
   :sme_justification: Not present in EE2 standards or examples
   :evidence: standards.rst lines 588-595

Why This Matters for NOAA:

Before (Phase 1) After (Phase 2)
328 false positive violations 48 legitimate violations
Hard-coded rules in JavaScript SME-maintained RST annotations
Changes required programming Zero code changes to update rules
No evidence trail Complete traceability to EE2 source

Documentation Suite

  • PHASE_2_HYBRID_ARCHITECTURE_SPECIFICATION - Complete technical specification of the hybrid architecture that generates runtime configuration from semantic embeddings. Covers the 5-component pipeline (EE2 Standards β†’ Annotations β†’ ChromaDB β†’ JSON Config β†’ Scan Tool), validation results, and scalability analysis. Essential reading for understanding how semantic intelligence achieves runtime performance.

  • SME_Training_QuickStart - Practical 2-hour training guide for Subject Matter Experts on creating and reviewing semantic annotations. Includes linguistic framework (for translators/language experts), the 7 MCP directive types, and hands-on exercises. Enables domain experts to maintain compliance intelligence without programming.

  • SME Training QuickStart Guide (PDF) - Printable version of the training materials for offline use and in-person training sessions.

Architectural Innovation

The "hybrid" pattern combines the best of two worlds:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  BUILD TIME: Semantic Intelligence                          β”‚
β”‚  ChromaDB embeddings + Neo4j relationships β†’ JSON Config    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  RUNTIME: Static Performance                                β”‚
β”‚  Load JSON once β†’ O(1) lookup per file β†’ Zero DB queries    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Result: Semantic understanding WITHOUT runtime database queries. Scan 647 files in 12 seconds with full evidence traceability.

Impact on AI-Assisted Development

This architecture enables a new paradigm for expert-in-the-loop AI development:

  1. AI generates compliance recommendations using RAG-enhanced search
  2. SMEs review and identify false positives
  3. Annotations capture corrections in machine-readable form
  4. Pipeline regenerates configuration automatically
  5. AI learns without code changes

This is institutional knowledge preservationβ€”capturing what experts know in a form that makes AI smarter.


οΏ½ NEW: RAG Embedding Space Theory (December 19, 2025)

RAG_manifolds β€” Dimensional Conformality in Vector Databases: The Mathematical Foundation of RAG Embedding Spaces

A deep dive into why query and document embeddings must inhabit the same metric space for semantic search to work. On the surface, the cosine similarity formula is undergraduate linear algebra β€” but the 768-dimensional feature spaces encode recursive linguistic structures, emergent semantic geometry, and the holistic paradox of meaning encoded in vectors. Covers the mathematical foundations (SBERT, DPR, RAG papers), the superposition hypothesis, and the philosophical implications of meaning-as-geometry.

"The feature spaces are indeed a true enigma of recursive and holistic complexities β€” basic on the surface, infinitely deep upon reflection."


οΏ½πŸš€ Previous Update: Embedding Model Upgrade (November 5, 2025)

Successfully upgraded the RAG system from all-MiniLM-L6-v2 (384 dimensions) to all-mpnet-base-v2 (768 dimensions), delivering a 50-100% improvement in semantic search quality for domain-specific queries. Empirical testing revealed the previous model achieved only 0.174-0.411 similarity scores on critical workflow terms (below the 0.5 acceptable threshold), prompting an immediate zero-cost upgrade. The new v4 collection achieved 73% completion (532/730 documents), enabling more accurate contextual AI assistance for global-workflow development and operations, with A/B testing and production cutover planned for completion. Full progress report.

Development followed the Empirical Accuracy Principle: all technical claims verified through measurement rather than assumption, ensuring trustworthy AI-assisted development practices.


🎯 MCP Tool Architecture: 21-Tool Agentic AI Platform

NEW: Comprehensive documentation of the Model Context Protocol (MCP) server architecture that transforms GitHub Copilot from code completion to autonomous development assistance.

MCP_TOOL_ARCHITECTURE - Deep dive into the 21 specialized tools organized into 5 functional categories:

  • WorkflowInfoTools (3) - Foundation layer with instant structural awareness
  • CodeAnalysisTools (4) - Graph-based relationship intelligence via Neo4j
  • SemanticSearchTools (7) - RAG-enhanced knowledge retrieval with ChromaDB
  • OperationalTools (3) - Deep domain intelligence for HPC operations
  • GitHubTools (4) - Repository and project collaboration intelligence

Why This Matters: This architecture represents a paradigm shift from "AI that writes code" to "AI that understands systems." By combining filesystem analysis, graph databases (Neo4j), vector embeddings (ChromaDB), and semantic search, the MCP platform enables:

  • Autonomous research across documentation, code, and issues
  • Impact analysis before making changes (dependency graphs)
  • Compliance verification (EE2 standards) during development, not after
  • Operational intelligence with HPC-specific guidance
  • Collaborative awareness of ongoing work and project history

Configuration Modes:

  • full - All 21 tools (complete development environment)
  • core - 7 tools (minimal, no databases required)
  • rag - 17 tools (RAG without GitHub integration)

The Result: A fully-functional integrated agentic software development platform that doesn't just generate code - it understands architecture, follows standards, prevents breaking changes, and collaborates effectively. This is the future of weather model development at NOAA.

Documentation Status: Version 3.0.0 | Week 2 Consolidated Architecture | November 4, 2025


πŸ”’ EE2 Compliance & Operational Readiness

NCEP Central Operations Compliance Analysis

Comprehensive EE2 compliance audits conducted using the MCP (Model Context Protocol) RAG infrastructure with hybrid semantic search (ChromaDB) and graph-based code analysis (Neo4j). These AI-assisted analyses examined hundreds of job scripts, execution scripts, and utility libraries to identify critical compliance gaps and provide production-ready remediation plans.

Global Workflow EE2 Analysis

  • EE2_COMPLIANCE_ANALYSIS_GLOBAL_WORKFLOW - 40+ page comprehensive audit of the global-workflow repository identifying top 5 critical compliance issues blocking operational deployment. Analysis covers 255+ files (172 job scripts, 83 execution scripts, utilities) with detailed remediation plans, production-ready code examples, and 14-week phased implementation timeline.

Key Findings:

  • Issue #1 (CRITICAL): Python error handling - 42 scripts lack try-except blocks
  • Issue #2 (HIGH): Shell error exits - && true pattern defeats error detection
  • Issue #3 (HIGH): Environment variable validation - ${PDY:-} defaults to empty
  • Issue #4 (MEDIUM-HIGH): Weak utility error handling - envsubst failures silent
  • Issue #5 (MEDIUM): Inconsistent set -e and missing trap handlers

Provenance: Generated via static analysis and MCP RAG tools examining NOAA-EMC/global-workflow fork using ChromaDB semantic search (730 docs) and Neo4j graph analysis (8709 relationships). Analysis date: November 3, 2025.

RRFS Workflow EE2 Analysis

  • EE2_COMPLIANCE_ANALYSIS_RRFS - Comprehensive EE2 compliance analysis of the RRFS (Rapid Refresh Forecast System) workflow repository. Examined 142+ files (26 jobs, 27 scripts, 35+ utilities, 54 Python modules) using MCP RAG tools. Key discovery: RRFS has better baseline compliance than global-workflow (consistent set -xue, custom error functions) but shares critical gaps. 10-week implementation plan with priority remediation targets.

Key Findings:

  • Issue #1 (CRITICAL): Missing err_chk function - All 26 job scripts call undefined function
  • Issue #2 (HIGH): Python error handling - Better structure than global-workflow but incomplete
  • Issue #3 (HIGH): Environment variable validation - Empty string defaults risk invalid paths
  • Issue #4 (MEDIUM-HIGH): No trap handlers - Resource leaks on failures
  • Issue #5 (MEDIUM): Insufficient error context - Good foundation needs enhancement

RRFS Advantages: Uses set -xue consistently (vs. set -e workarounds), custom print_err_msg_exit with caller context, filesystem operations use *_vrfy wrappers.

Provenance: Generated via MCP RAG hybrid analysis (semantic + graph) of NOAA-EMC/rrfs-workflow repository using ChromaDB vector search and Neo4j dependency mapping. Analysis date: November 3, 2025.


πŸš€ Advanced RAG & Graph Intelligence Infrastructure

Strategic Architecture Documents

The Global Workflow development infrastructure has evolved to incorporate state-of-the-art RAG (Retrieval-Augmented Generation) and Graph Database technologies, enabling sophisticated agentic AI capabilities for GFS software management and error analysis.

Core Infrastructure Documentation

  • README_PROVISIONING_V3.1_COMPLETE - Complete provisioning guide for the MCP RAG persistent infrastructure on ParallelWorks cloud platform. Covers ChromaDB 1.1.1 deployment, Node.js MCP server setup, LangFlow integration, and systemd service configuration for production-grade persistent storage architecture.

  • ENHANCED_INGESTION_ARCHITECTURE - Comprehensive design for Context7-inspired multi-source RAG ingestion across 50+ GFS submodules (3-5M LOC). Details the hybrid triple-store architecture combining ChromaDB (semantic search), Neo4j (graph relationships), and PostgreSQL (temporal data) for intelligent error diagnosis and code understanding.

  • CHROMADB_MIGRATION_COMPLETE - Technical documentation of ChromaDB 0.4.x to 1.1.1 migration, including API compatibility updates, Node.js client integration ([email protected]), and resolution of embedding dimension mismatches for production stability.

Why Graph RAG for GFS Complexity?

The Challenge: The Global Forecast System represents one of the most complex software ecosystems in scientific computing:

  • 50+ interconnected repositories (UFS, GDAS, GSI, GOCART, MOM6, CICE, WW3, etc.)
  • 3-5 million lines of code across Fortran, Python, C/C++, and CMake
  • Deep dependency chains spanning atmospheric dynamics β†’ ocean coupling β†’ data assimilation β†’ post-processing
  • Multi-component interactions that traditional documentation cannot capture

The Solution: Hybrid Graph + Vector RAG Architecture

Traditional vector-based RAG (ChromaDB alone) excels at semantic similarity but cannot answer structural questions:

  • ❌ "What components are affected if I change FV3 dynamics?"
  • ❌ "What's the dependency chain causing this compilation error?"
  • ❌ "Which CMakeLists.txt needs to link the GSW library?"
  • ❌ "Show me the call graph from model initialization to MPI communication"

Graph RAG (Neo4j + ChromaDB) enables these capabilities:

Error Analysis Workflow:
β”œβ”€ Semantic Search (ChromaDB): Find similar errors and solutions
β”œβ”€ Structural Analysis (Neo4j): Trace dependency chains and call graphs
β”œβ”€ Temporal Context (PostgreSQL): Recent commits and regression patterns
└─ LLM Synthesis: Root cause + Fix instructions + Prevention recommendations

Agentic AI for GFS Software Management

The MCP (Model Context Protocol) server provides LLM agents with:

  1. Deep Code Understanding: Not just text search, but comprehension of component interactions
  2. Error Diagnosis: 10x faster debugging by combining similar past errors with structural impact analysis
  3. Impact Prediction: "What breaks if I change X?" before making changes
  4. Knowledge Retention: Institutional expertise captured in graph relationships
  5. Cross-Component Reasoning: Trace errors through UFS β†’ GSI β†’ GDAS β†’ GFS pipeline

Result: Transform debugging from "search documentation and guess" to "query knowledge graph and know."

Implementation Status

  • βœ… ChromaDB 1.1.1: Production vector database operational
  • βœ… Node.js MCP Server: 17 tools for workflow management and RAG search
  • βœ… LangFlow UI: Visual workflow builder for RAG pipelines
  • 🚧 Neo4j Graph DB: Phase 0 POC approved, weekend implementation planned
  • πŸ“‹ Enhanced Ingestion: Multi-source ingestion pipeline designed for 50+ repos

Next Milestone: Neo4j proof-of-concept demonstrating dependency graph queries that ChromaDB cannot answer.


πŸ“– NCEPLIBS-BUFR Error Catching Initiative

Core Documentation

  • PR673_Comprehensive_Analysis - Complete technical analysis of PR #673 which introduced error catching capability to NCEPLIBS-bufr. This 50+ page analysis covers the architectural design using setjmp/longjmp, implementation details across 51 files, code review insights, testing strategy, and operational impact for NOAA's weather forecasting infrastructure.

  • ERROR_CATCHING_IMPLEMENTATION_PLAN - Detailed 17-week implementation plan for extending error catching to 24 additional I/O routines following the PR #673 pattern. The plan divides work into 4 phases by complexity level, includes automated testing frameworks, CI/CD strategies, and comprehensive quality assurance checklists.

  • additional_io_routines_for_error_catching - Comprehensive inventory of 38 additional I/O routines organized into 7 complexity levels for systematic error catching implementation. This reference document provides technical details, implementation priorities, and success metrics for achieving complete API coverage in the BUFR library.


πŸ§ͺ Background information of Cases used in the CTest Framework

The CTest framework provides self-contained test cases for validating individual workflow components. Each test creates an isolated environment with staged inputs from nightly stable baseline runs, enabling independent testing and validation.

C48 Fixed Atmosphere-Only Tests (ATM)

C48 Coupled System Tests (S2SW)

C48 Ensemble Tests (S2SW_gfs)

  • C48_S2SWA_gefs-gefs_fcst_mem001_seg0
  • GEFS ensemble member 001 coupled forecast (48-hour segment)
    • Implemented GEFS ensemble member 001 forecast test
    • 17 input files with unique two-cycle pattern:
      • 13 atmosphere ICs from current cycle (12Z)
      • 3 restart files from previous cycle (06Z)
      • 1 wave prep file from current cycle (12Z)
    • 24 output files (ensemble forecast outputs)
    • GEFS requires different source cycles for ICs vs restarts
    • Special handling for mem001/ subdirectory structure
  • C48_S2SWA_gefs-gefs_fcst_mem001_seg0.yaml

Framework Features:

  • Self-contained test environments with isolated EXPDIR
  • Input staging from STAGED_CTESTS (stable nightly runs)
  • Consistent naming convention: CASE-JOB.yaml
  • Comprehensive validation with input/output file verification

οΏ½ CI Error Analyses (MCP-RAG Assisted)

Detailed root-cause analyses of CI failures, performed using the EIB MCP-RAG GraphRAG toolset. Each report includes execution chain tracing, environment variable dependency mapping, and an MCP tool call scorecard.


οΏ½πŸ”§ CI/CD & DevOps

GitLab CI/CD Pipeline

Jenkins Integration

GitHub & Jenkins Integration


πŸ€– AI/ML & Intelligent Tools

Model Context Protocol (MCP)

AI Development Tools


🌊 Workflow Management Systems

Rocoto Workflow Engine

CROW & EcFlow

🌐 Weather Modeling & Configuration

GCAFS (Global Composition/Chemistry Aerosol Forecast System)

  • GCAFS-Overview - Comprehensive analysis of NOAA's next-generation aerosol and air quality forecasting system. Documents GCAFS architecture, its relationship to global-workflow, development timeline (4,040 commits since 2016), key contributors (Barry Baker, Li Pan, Cory Martin), and operational readiness status. GCAFS represents the fourth major forecasting capability alongside GFS, GEFS, and SFS, integrating the GOCART model for aerosol transport/chemistry. Analysis date: January 30, 2026

Model Configuration


πŸ’» HPC System Administration

MPMD & MPI Runtime Infrastructure

  • MPMD_MPI_Runtime_Infrastructure - Comprehensive documentation of the Multi-Program Multiple-Data (MPMD) execution framework and MPI runtime configuration across all 11 supported HPC platforms. Covers the core run_mpmd.sh orchestration script, platform-specific launcher configurations (Slurm srun --multi-prog vs PBS mpiexec cfp), MPI tuning parameters (Intel MPI, Cray MPICH, PMI2), network fabric details (InfiniBand, Slingshot, EFA), and the three-level job resource configuration chain. Essential reference for HPC operations, platform portability, and debugging parallel execution issues. Analysis date: January 30, 2026

  • MPMD & MPI Runtime Infrastructure Technical Paper (PDF, 17 pages) - Detailed LaTeX technical specification with architecture diagrams, algorithm pseudocode, platform comparison tables, MPI tuning parameters, and complete environment file appendices.

Resource Configuration: CROW vs Global-Workflow

  • Resource Configuration Comparison Technical Paper (PDF, 44 pages) - Comprehensive technical analysis comparing the declarative CROW system (2016-2020) with the current imperative Global-Workflow approach (2020-present). Includes TikZ architecture diagrams, algorithm pseudocode, detailed code examples from the CROW YAML DSL and current shell-based configuration, MPMD runtime integration analysis, validation pipeline recommendations, and architectural guidance for next-generation workflow infrastructure. Covers the complete resource specification lifecycle from definition through validation to runtime execution. Analysis date: January 30, 2026

System Configuration


πŸ”¬ Research & Theory

Scientific Computing


πŸ› Development & Debugging

Bug Fixes & Solutions

Development Process


πŸ“š Quick Reference

Most Viewed Topics:

  • CI/CD Pipeline Architecture
  • Rocoto Workflow Management
  • MCP/RAG Integration
  • Jenkins Configuration
  • HPC System Setup

Latest Updates:

  • Phase 2 Semantic Annotation Architecture (December 2025)
  • SME Training for Semantic Annotations (December 2025)
  • Hybrid Build-Time/Runtime Compliance Validation
  • MCP Server RAG Enhancement
  • AI-Assisted Development Tools

This wiki is actively maintained. Last organized: December 2025

⚠️ **GitHub.com Fallback** ⚠️