Global Workflow Wiki

Welcome to the NOAA Global Workflow technical wiki. This knowledge base documents solutions, configurations, and insights for operating and developing the Global Forecast System workflow.

🩺 MDC MCP-RAG Server Health Status Report (Gateway)

(July 13, 2026)

agentcore-mcp-rag-Health-Status-Report-2026-07-13 — System report based on the 53 tools from the Gateway providing multi-tenant vector/graph metrics, active capability tracking, and knowledge base structural integrity.

This report details health across 4 active base systems, revealing 225,836 Neo4j nodes mapped across 5 tenants, and zero flagged path inconsistency or stale embeddings via random sampled checks in ChromaDB.

� MDC MCP-RAG Platform — June / July 2026 Milestone Report

(July 6, 2026)

MCP-RAG-Platform-June-July-2026-Milestone-Report — 9 completed milestones across 115 commits (June 1 – July 6, 2026): multi-tenant foundation, gap-closure sprint, AWS cost-control IaC, Python PW integration, CI error distillation, backend rename, Docker MCP Gateway Python parity, static-container retirement, and ARM dev-host migration spec

Covers the complete transition from a single-tenant AWS prototype to a production-grade 5-tenant dual-backend system (COTS/AWS) running on Parallel Works with 53 tools, 220,538 vector documents, 225,836 graph nodes, and 11/11 gateway functional tests passing. Includes an open-gaps table carried into July: non-default tenant ingestion (gw_sfs, gw_jedi_gfs, gw_gefs_v12 empty), integrity-check scoping false-positives, and pending quality-benchmark generation.

🧬 Gemini Embedding 2 Multimodal Provider — Evaluation Plan & NOAA API Key Request

(July 7, 2026)

Gemini-Embedding-Provider-Evaluation-and-Key-Request — Plan to evaluate and integrate Google's gemini-embedding-2 — its first natively multimodal embedding model (text + image + audio + video + PDF in one vector space) — against our current Bedrock Titan-1024 and local MPNet-768 providers, plus the concrete API-key request to take to our NOAA/Google contact

Documents where embeddings are generated today (the shared provider abstraction, model registry, and the single ingestion_base.py::run() embed call), then specifies an additive GeminiProvider: gemini2_3072 / gemini2_768 registry profiles, a create_provider() dispatch arm, and a zero-dependency urllib-based class with text and image paths (no new pip deps on the ingest box or AgentCore image). Because gemini-embedding-2 auto-normalizes every dimension and uses text-prefix task instructions, the provider does no client-side normalization and sends no taskType. Adds a new additive --images ingestion route so *.png files embed into a unified gemini2_* collection for cross-modal (text→image) retrieval. Includes the actionable key-request checklist (Developer-API key vs Vertex AI, gemini-embedding-2 access, paid tier for RPM headroom, egress allowlist for generativelanguage.googleapis.com, and the paid/Vertex "not used for training" data-governance point), federal compliance notes, integration gotchas, and the ingest + benchmark_runner.py comparison workflow (incl. a cross-modal smoke). Formalized as Kiro spec gemini-embedding-provider with a sister SDD workflow (Phase 66); the earlier text-only gemini-embedding-001 framing was dropped.

�🔑 AWS EC2 SSH Access from Windows PowerShell — GFE Onboarding

(July 2, 2026)

AWS_EC2_SSH_PowerShell_Onboarding — Verified, PowerShell-only recipe for new Government-Furnished Equipment (GFE) users to reach the AWS EC2 development instance over the localhost:2222 tunnel using only the tools built into a standard Windows 11 image — no Git Bash, no installs, no admin

New GFE users cannot use the established Git Bash ssh AWS_EC2_tmcg alias because Git for Windows is neither installed nor permitted. This guide documents a known-good end-to-end path using the Windows-bundled OpenSSH client (OpenSSH_for_Windows_9.5p2) from powershell.exe. Four steps: generate a passphrase-less Ed25519 keypair with an owner-locked NTFS ACL (via a gen-ec2-test-key.ps1 script), hand off the public key to the remote admin, append an AWS_EC2_tmcg_ps block to %USERPROFILE%\.ssh\config (using IdentitiesOnly yes and the Windows-portable UserKnownHostsFile NUL), then verify with ssh -v. Includes a MSYS-vs-Windows-OpenSSH client-difference table (key-permission enforcement, ~ expansion, /dev/null vs NUL, ssh-agent defaults), a symptom-to-cause troubleshooting table, a diagnostic log-capture one-liner, a per-user file inventory, and legacy-alias coexistence notes. The new AWS_EC2_tmcg_ps (Ed25519) alias intentionally coexists with the legacy AWS_EC2_tmcg (RSA) alias so Git Bash users are unaffected.

🩺 MCP Server Health & Status — Comparative Report (AgentCore vs Gateway)

(July 2, 2026)

MCP-Health-Status-Comparative-Report — Live operational health/status snapshot comparing the agentcore-mcp-rag (AWS-native, Python/FastMCP) and eib-mcp-gateway (Node.js/Docker) MCP servers, built by running the six health/status utility tools on each at maximum detail

Runs mcp_health_check (deep+detailed+functional), get_server_info, get_knowledge_base_status, get_health_trend, get_quality_metrics, and check_knowledge_integrity against both servers and presents each result in a three-column table (AgentCore | Gateway | comparative analysis). Both servers report HEALTHY and expose the same ~51–52 tool surface over an identical file set (17,273 files match exactly). The meaningful gaps are architectural and maturity-driven: AgentCore is multi-tenant (5 branch-isolated tenants) with a 7.5× larger code graph (148,976 nodes, 95,996 functions, 27,941 Fortran subroutines) and dual Titan-1024 + MPNet-768 embeddings, but has no quality benchmark generated, an unmounted /mnt/workflow, and two integrity warnings (path prefixes, stale embeddings). Gateway is the mature single-tenant reference with a full quality baseline (P@5 0.71, MRR 0.93), clean integrity, and richer shell-operational graph detail (628 scripts categorized), but a smaller graph and single embedding family. Includes four recommended follow-ups and a count-drift note. Complements the design-level MCP-Parity-Assessment-SoC-Topology.

🛡️ MCP External Access (Path B) — FedRAMP Readiness Review

(June 30, 2026)

MCP-External-Access-FedRAMP-Readiness-Review — FedRAMP / NIST 800-53 readiness review of the mcp-external-access spec (Cognito JWT authorizer on the AgentCore Runtime), prepared for the AWS Summit DC FedRAMP session

Reviews the Path B design through a federal-authorization lens. Headline: the authorization model is sound and maps cleanly onto 800-53 (least-privilege IAM/AC-6, no long-lived CI secrets via GitHub OIDC/IA-5, server-side scope-to-tool enforcement/AC-3, structured per-call audit/AU-2-3, CDK + drift detection/CM-2-6 + CA-7), but the gating risk is foundational, not architectural: the feature runs on Bedrock AgentCore in commercial us-east-1, and AgentCore is not (yet) in any FedRAMP boundary — Bedrock's FedRAMP High is GovCloud-only, while Cognito is authorized in US East/West. Includes the three asks for AWS, a strengths table, seven prioritized gaps (MFA/IA-2, FIPS/SC-13, CMK at rest/SC-28, public-edge DoS/SC-5, GitHub runners outside boundary/CA-3, audit-retention reconciliation/AU-11, data classification), and a full NIST 800-53 control-mapping summary. Complements MCP-External-Access-Design-Path-B and MCP-Access-Architecture-Proposal.

⚖️ SDD Framework Comparative Assessment — EIB SDD vs. Superpowers

(June 24, 2026)

SDD-Framework-Comparative-Assessment-Superpowers — Comparative assessment of two community agentic-development frameworks (obra/superpowers and its Copilot CLI repackaging DwainTR/superpowers-copilot) against the EIB Spec-Driven Development (SDD) Framework that powers the MDC MCP-RAG Server

Finds the frameworks complementary rather than competing: Superpowers is a portable, harness-agnostic skills library that shapes how an agent behaves (brainstorm → plan → enforced TDD → two-stage review) across 11 coding agents, while EIB SDD is a domain-anchored, stateful methodology that governs what gets coded in the safety-critical Global Workflow repo via a spec-first gate, JSONL session ledger, GraphRAG code-graph + 200K-doc knowledge base, and formal ISD/USD safety modes. Includes a side-by-side dimension matrix, philosophy/architecture contrast, strengths-and-gaps analysis, shared blind spots (neither offers true sandboxed execution), and five concrete adoption recommendations — chief among them an enforced test_first step type, two-stage subagent review for Phase 4C USD, and stronger auto-triggering of start_sdd_session.

🔬 JGLOBAL_ENKF_ECEN Stale File Handle — Failure Analysis

(Jun 17, 2026)

JGLOBAL_ENKF_ECEN-Stale-File-Handle-Analysis — Root-cause analysis of a transient Lustre filesystem failure in the EnKF ensemble recentering step on Hercules, with five recommended best practices to prevent the cascade

A nightly CI run failed when prep_step hit Stale file handle between two consecutive MPI invocations. The compute work succeeded, but no retry on transient errors plus aggressive KEEPDATA=NO cleanup destroyed the post-mortem evidence. Recommends a prep_step_retry wrapper, conditional RUNDIR preservation, err_exit snapshots, and CI-level auto-retry on infrastructure flakes. Sourced via agentcore-mcp-rag MCP-RAG queries.

GFS v17 ecFlow Script Documentation (80 scripts)

(June 11, 2026)

v17-ecf-scripts-summary — Comprehensive analysis of all 80 ecFlow scripts on the GFS v17 branch (dev/gfs.v17), covering execution chains, environment variables, failure modes, and change impact

Automated documentation pipeline produced per-script analysis documents for every ECF task template in the v17 suite (GFS 41, GDAS 29, EnKF 10). Each document covers the full call chain from ecFlow suite definition through J-Job, ex-scripts, and ush helpers. Analysis used AgentCore MCP-RAG (tenant gw_v17) cross-referenced against the on-disk v17 worktree. Includes a summary table with links to all 80 individual analyses and a CSV catalog for spreadsheet use.

📋 GFS v17 ecFlow Script Reconciliation

(Jun 16, 2026)

v17-ecf-script-reconciliation — Full reconciliation of the 80 ecFlow .ecf scripts between the original develop-branch list and the dev/gfs.v17 branch on-disk layout

Documents every script that moved (23), was renamed (27), was removed (5), or is brand new (9) in the v17 reorganization. Includes directory restructuring patterns and naming convention changes. Reference for the ecf-manual-documentation spec that adds %manual documentation blocks to each script.

� NIH AWS Sandbox — MDC MCP-RAG Extension Proposal

(June 25, 2026)

NIH-AWS-MCP-RAG-Extension-Proposal — Two-page management proposal: prototype complete, requesting NIH AWS Sandbox extension to expose the 51-tool MCP endpoint to GitHub Actions CI/CD and RDHPCS researcher sessions (Hera, Orion, Hercules, Gaea, Ursa)

Summarizes what was delivered under the NIH funding envelope — the operational MDC MCP-RAG platform on AWS Bedrock AgentCore, Neptune (148K nodes / 2.8M relationships), and OpenSearch (206K documents) — and enumerates every AWS service in use (AgentCore, Bedrock, Neptune, OpenSearch, EC2, ECR, EFS, S3, Secrets Manager, SSM, IAM, VPC + 10 endpoints, CloudWatch, CDK/CloudFormation, SageMaker held in reserve). Defines the extension scope (Cognito JWT inbound auth, GitHub OIDC → Token Broker Lambda, HPC CLI helper, server-side scope enforcement, audit logging) and references the operator-driven NIH Sandbox Cost-Control System that extends the operational lifetime of every NIH dollar.

📊 MCP Parity Assessment & SoC Topology Analysis

(May 20, 2026)

MCP-Parity-Assessment-SoC-Topology — Comparative assessment of AgentCore (Python/Titan 1024-dim) vs Gateway (Node.js/MPNet 768-dim) MCP servers with tool surface refinement recommendations

Covers: response quality comparison across all 9 modules, data gap analysis (AgentCore has 56% more docs, 35x more graph data), SoC topology issues (overlapping concerns, misplaced session tools, LLM decision fatigue), proposed 7-module refined topology (45 core tools from 52), LLM efficiency recommendations (tool descriptions, parameter naming, response format standardization), and composite tool patterns for reducing round-trips.

🔐 MCP External Access — Path B Design (Cognito JWT on AgentCore Runtime)

(May 7, 2026)

MCP-External-Access-Design-Path-B — Full design for exposing the 51-tool MCP server to GitHub Actions CI pipelines and HPC user sessions (Hera, Orion, Hercules, Gaea, Ursa) via Cognito JWT inbound auth on the existing AgentCore Runtime

Covers: Cognito User Pool with CI/HPC app clients, Token_Broker Lambda (GitHub OIDC → JWT), HPC_CLI_Helper (device-flow CLI for login nodes), server-side tool scoping (40 tools for CI, 48 for HPC, all 51 for developers), audit logging (CloudWatch JSON Lines), CDK stack layout, network verification (VPC-mode + public inbound confirmed compatible), and the deferred Path C migration outline (AgentCore Gateway + Cedar policies). Includes 8 correctness properties (P1–P8) and full requirement traceability (R1–R11).

🧭 Phase 54: OMD Multi-Program MCP Initiative — SDD Spec

(May 6, 2026)

Phase-54-OMD-Multi-Program-MCP-Initiative — SDD Initiative spec to decompose the monolithic AgentCore MCP server into pillar-scoped tenants for the OMD portfolio. 2026-05-22 ground-truth refresh: now targets six in-repo pillars (develop, dev/sfs, dev/gfs.v17, dev/gcafs.v1, dev/gfs.v16, dev/jedi-gfs) plus future external programs (UFS, MPAS). Identifies dev/sfs as the natural pilot.

Addresses a configuration-management hazard reported by repo CMs: the divergent gfsv17-coupled branch lives in the same repository as develop, and AI agents indexing the monolithic server cannot tell which branch's code they are reasoning about — answers cross-contaminate. The Initiative defines six workstreams (54a–54f): tenant catalog schema, per-tenant Neo4j/Chroma data isolation, shared-component inheritance via extends: (so MPAS transparently inherits UFS tools), branch-scoped tenants with ingestion dedupe, agent-facing tenant routing & attribution, and a supporting headless auth broker for third-party MCPs (arXiv, Semantic Scholar). Includes branch-isolation probe, inheritance probe, and CM-hazard regression test as acceptance criteria. Naming note: uses "Initiative" rather than "EPIC" to avoid clash with the NOAA Earth Prediction Innovation Center.

🔬 FMS (Flexible Modeling System) Usage in Global Workflow Infrastructure

(May 4, 2026)

FMS-Usage-in-Global-Workflow-Infrastructure — MCP-RAG assisted analysis of where FMS is used across the Global Workflow: UFS model core (188/316 callers for fms_init/fms_end), FV3 atmosphere, MOM6 ocean, CICE sea ice, JEDI/GDAS data assimilation, and the diagnostic output system

Comprehensive investigation using 19 MCP tool calls across the Neptune graph (148K nodes) and OpenSearch vector store (206K docs). FMS provides message passing, I/O, diagnostics, domain decomposition, and physical constants to every model component. Includes dependency map, diag_table configuration reference, DA format conversion paths, and build system integration.

Companion: FMS-Analysis-MCP-Tool-Effectiveness-Report — tool-by-tool effectiveness assessment rating each of the 19 MCP tool calls used in this analysis, with recommendations for tool improvement.

🏗️ MDC MCP-RAG AWS Architecture — Operational Reference (v3)

(May 1, 2026)

MDC-MCP-RAG-AWS-Architecture-v3 — Complete architecture document for the fully operational AWS deployment: AgentCore Runtime v4, Neptune (148K nodes, 2.8M rels), OpenSearch (206K docs), private VPC with 10 endpoints

Documents the deployed system as of May 2026 with live AWS API-sourced data. Covers the AgentCore Runtime (Firecracker microVM hosting 51 MCP tools), Neptune graph database (db.r8g.xlarge, openCypher + SigV4 IAM auth), OpenSearch vector search (2× r6g.large, zone-aware, 17 indices across 3 embedding models), fully private VPC (no IGW, no NAT), 3 CDK stacks, 10 VPC endpoints, Kiro proxy bridge, and the parallel legacy system on RDHPCS. Includes mermaid architecture diagrams, sequence diagrams, formal interface specifications, correctness property validation status, measured performance data from parity testing, and deployment timeline. Supersedes the Phase 46 design spec with actual deployed state.

🚀 MCP Access Architecture — Multi-User Scaling & CI/CD Integration

(April 30, 2026)

MCP-Access-Architecture-Proposal — Two-phase deployment strategy for scaling the MDC MCP RAG Server to a 10-person developer cohort (Phase 1) and GitHub Actions CI/CD pipelines (Phase 2)

Phase 1 uses AgentCore Runtime with per-user microVM session isolation, accessed via GFE laptops over CAC-enabled VPN with IAM authentication (already validated). Phase 2 adds a FedRAMP-compliant Private API Gateway + Fargate path for machine-to-machine access from GitHub Actions with mTLS client certificates. Both phases maintain zero internet exposure for the data plane (Neptune, OpenSearch). Includes architecture diagrams, FedRAMP control mappings, cost estimates, and timeline.

🧰 OMD Agentic AI Toolset — Capability Catalog

(April 27, 2026)

OMD-Agentic-AI-Toolset — 52 AI assistant capabilities purpose-built for the NOAA Global Workflow, developed by the Office of Modeling and Development (OMD)

Introduces the concept of an agentic AI toolset (intelligent AI actions backed by a live code graph and 85,995-document knowledge base) before transitioning into technical detail. Covers all 8 capability domains: Workflow Info, Code Analysis, Knowledge Base Search, EE2 Compliance, Operational Guidance, Architectural Reasoning (GraphRAG), GitHub Integration, and Spec-Driven Development Sessions. Includes a "without vs. with the toolset" comparison table for management audiences and a full summary by category. Tagline: A well-equipped OMD developer — in every editor, from day one.

🌊 C48_S2SW `gfs_waveinit` Failure on Orion — Root Cause Analysis

(April 27, 2026)

C48_S2SW-gfs_waveinit-Orion-srun-layout-Error-Analysis — MCP-assisted root cause analysis of a gfs_waveinit failure on Orion (JobId 22929648, build nightly_0_c176171a_13777)

The reported FATAL ERROR: No model definition file for grid ep_10m RETURN CODE 3 is a downstream symptom — the actual failure is an srun task-layout abort (bind_type:threads,mask_cpu,one_thread - this should never happen / Unable to layout tasks on given cpus) inside ush/run_mpmd.sh. Because run_mpmd.sh does not capture srun's exit code, the launch failure is silently swallowed and the missing mod_def.*.bin file check then raises the misleading fatal. Includes evidence trail, three latent bug findings (srun exit-code masking, JGLOBAL_WAVE_INIT:21 10#-1 parse error, missing date/t12z CI seed), and prioritized recommendations.

Companion: C48_S2SW-gfs_waveinit-Orion-MCP-Tool-Effectiveness-Report — post-analysis of which MCP tools were used during this investigation and how effective each one was (rating table + methodology critique + skipped-tool recommendations).

🐍 Python MCP Server Port — Progress

(May 13, 2026)

Python-MCP-Server-Port-Progress — Mid-port from custom Node.js MCP server to AgentCore Python SDK + FastMCP. 35 of 51 tools (69%) ported and committed; staging runtime live alongside production

The MDC MCP RAG Server is mid-port from Node.js to Python on top of the AgentCore Python SDK. The Node.js runtime continues to serve production while the Python port is validated module-by-module on a parallel staging runtime via a parity test framework that compares every tool against the live Node.js baseline. 35 of 51 tools committed across phases B4–B9; remaining 16 tools (sdd_workflow, workflow_info, github_tools) are filesystem and REST-only — no infrastructure dependencies. Roughly 4 more focused sessions to feature-parity-deployed, 5 to production cutover. The port is driven by platform alignment (AgentCore-native deployment lifecycle), free connection pooling (opensearch-py, boto3), and unlocking Strands/Memory/Cedar for the Tier B consumer agent ecosystem.

☁️ AWS MCP/GraphRAG Migration — Executive Overview

(April 23, 2026)

AWS-MCP-RAG-Migration-Executive-Overview — High-level overview of the Parallel Works to AWS migration: motivation, current status, IaC with LLM assistance, and architecture

The MDC MCP RAG Server (51+ tools, 86K vector docs, 59K graph nodes, 2.6M relationships) is being migrated from Docker Compose on NOAA Parallel Works VMs to AWS-native services (OpenSearch, Neptune, ECS Fargate, Bedrock AgentCore). OpenSearch is fully operational with hybrid BM25+kNN search. All 45 non-GitHub tools validated at 100% pass rate against AWS backends. CDK stacks deployed with automated data safety guardrails. Three AWS Kiro Powers (IaC, IAM Policy Autopilot, AgentCore) provide real-time LLM assistance during infrastructure development. AgentCore Runtime deployment is next, followed by production cutover.

🧭 v17 `dev_gsiupd2` Paradigm Assessment

(April 22, 2026)

v17_paradigm_assessment — Branch-comparison and infrastructure assessment of the CatherineThomas-NOAA/global-workflow dev_gsiupd2 line vs. forked develop, with MCP/GraphRAG gap analysis

Anchors the upcoming SDD work plan to fold v17 (pre-coupled-modeling DA trajectory) coverage into the EIB MCP/GraphRAG knowledge base. Catalogs the 21-commit delta on top of forked develop (NSF/NCAR Derecho platform support, new SFSAppConfig application, exclusive-node task scheduling in rocoto/tasks.py, generate_workflows.sh -S/-C/-I flags, broad ush/ refactor, .j2 templating push, and submodule pin bumps including gsi_enkf.fd@feature/llvm). Includes 9-case EXPDIR survey, 6 identified KB gaps (missing SFS app, Derecho host, is_exclusive resource branch, etc.), and a 6-step recommended path for dual-branch ingestion with branch/commit_sha/paradigm metadata tagging.

🔧 Rocoto `--dryrun` Mode — Restore and Harden PR

(March 27, 2026)

Rocoto-Dryrun-Mode-PR-Restore-and-Harden — PR description for restoring --dryrun / -n flag with daemon-layer fixes (17 files, 7 schedulers)

Restores the --dryrun flag (originally PR #117, reverted in #119) with 14 targeted fixes across Rocoto's DRb daemon proxy layer, workflow engine, and server entry point. Addresses NameError crashes in forked rocotobqserver daemons, thread pool deadlocks, and stale DRb connection failures. All 7 batch schedulers (Slurm, PBS Pro, Torque, Moab, LSF, LSF Cray, Cobalt) guarded. Includes pipeline failure analysis confirming a Lustre I/O error (not this branch) caused the March 27 Hera outage.

How the GGSR Search System Works

How-the-GGSR-Search-System-Works — Plain-language explanation of the Graph-Guided Semantic Retrieval pipeline: how the system decides whether to search code structure or documentation, walks the relationship graph, runs hybrid text search, and combines the results

EIB MCP-RAG Platform — Executive Summary

The SDD-Framework-Status-Report has reached production maturity across all knowledge domains:

Metric	Value
SDD Sessions	26 completed, 0 abandoned (100% completion)
MCP Tools	51 tools across 8 modules
Knowledge Base	85,995 docs (ChromaDB) · 95K nodes / 2.65M relationships (Neo4j)
Documentation Sources	35 (incl. ESMF, NUOPC, MOM6, CICE, CCPP, WW3, METplus)
RAG Quality	P@5 0.71 · MRR 0.93 · 93% coverage · P95 latency 135ms
Coverage Floor	No domain below B- — wxflow A, Orchestration A-, DA A-

Key capabilities: Fortran/Python/Shell code graph traversal, ESMF/NUOPC coupling framework docs, EE2 compliance validation, JEDI deep submodule coverage (8,990 nodes), hierarchical community detection (2,113 LLM summaries), and automated health monitoring with regression detection.

→ SDD-Framework-Status-Report for phase-by-phase history, scorecard, and remaining gaps.

☁️ AWS Infrastructure Port — Phase 46 Design Specification

(March 20, 2026)

AWS-Infrastructure-Port-Phase-46-Design — Complete design spec for migrating the MDC MCP RAG Server from Docker to AWS-native services (Neptune, OpenSearch, ECS Fargate, CDK)

☁️ SageMaker GraphRAG Production Pipeline Strategy

(April 14, 2026)

SageMaker-GraphRAG-Production-Pipeline-Strategy — Strategic plan for activating Amazon SageMaker for automated drift detection, scheduled re-ingestion, and domain-adaptive fine-tuning

All SageMaker components are built (Phase 49): job launcher, ECR Dockerfile, drift detector, benchmark runner, fine-tuning pipeline, and hard negative miner. Activation is deferred until the production self-healing loop is needed — the current Bedrock Titan re-ingestion runs in 30 minutes on EC2 for under $1. SageMaker becomes essential for weekly drift detection ($1/month), quarterly fine-tuning ($10-15/run), and the full self-healing GraphRAG loop (~$6/month). Includes cost projections, architecture diagrams, and activation prerequisites.

Full consolidated design document covering the port of all 51 MCP tools and 9 modules from Docker Compose on Parallel Works VMs to AWS-managed infrastructure. Includes 17 requirements, 6 design components (database adapter layer, APOC transformation engine, ECS Fargate hosting, API Gateway + CloudFront, CDK stacks, secrets management), 13 formal correctness properties, 5-phase implementation plan (46A–46E, 7 weeks), data migration strategy for ~95K graph nodes and ~81K vector documents, and estimated monthly cost of $545–745. All new infrastructure uses mdc-mcp-rag naming with phased rollout preserving legacy system availability.

📋 SDD Framework Status Report

(March 16, 2026)

SDD-Framework-Status-Report — Comprehensive status of all 26 SDD sessions, 45 workflows, ESMF/NUOPC ingestion, knowledge base coverage scorecard, and platform health

Full report covering the Spec-Driven Development lifecycle: 26 completed sessions with 0% abandonment, 219 total steps, 85,995 documents across 6 ChromaDB collections, 2.65M Neo4j relationships. Covers Phase 41 (ESMF/NUOPC ingestion — 10,812 chunks), Phase 46 (gap closure), RAG quality metrics (P@5=0.71, MRR=0.93), Docker Gateway status, and remaining gaps. No domain below B-.

🔍 ICSDIR_ROOT Removal Impact Analysis

(March 16, 2026)

ICSDIR_ROOT-Removal-Impact-Analysis — GraphRAG-assisted analysis of safely removing ICSDIR_ROOT from CI platform configs and case YAMLs

MCP GraphRAG tools (find_env_dependencies, get_code_context, search_documentation) were used to trace ICSDIR_ROOT through the full dependency chain: 6 platform configs → 25 CI case YAMLs → create_experiment.py → setup_expt.py → config.stage_ic.j2 → parm/stage/*.yaml.j2. Analysis confirms the variable is redundant with BASE_IC (defined in dev/workflow/hosts/<platform>.yaml), as config.stage_ic.j2 already has a built-in fallback. Documents 19 safe-to-remove cases, 8 non-standard exceptions, and required CTest/unit test updates.

🧪 MCP/RAG Baseline Test — C96_atm3DVar_extended Analysis

(March 20, 2026)

MCP-RAG-Baseline-Test_C96_atm3DVar_extended_Analysis — Baseline quality benchmark for porting the MCP/RAG system to AWS Bedrock ecosystem

End-to-end analysis of the C96_atm3DVar_extended CI case using 11 MCP tool calls across the eib-mcp-gateway (v3.6.2, 51 tools). Explains why the case is called "extended" (more DA cycles + full downstream product suite), includes a scored tool call log (avg 3.2/5.0), and identifies that graph-based tools (find_env_dependencies, get_job_details) outperformed semantic search for code-structural questions. Key insight for AWS port: Neptune (graph) will be the primary value driver; OpenSearch (vector) better suited for documentation queries.

☁️ Parallel Works RDHPCS Platform Dashboard

(March 6, 2026)

Parallel-Works-RDHPCS-Platform-Dashboard — Comprehensive inventory of the NOAA RDHPCS Hybrid Cloud: 35 clusters, $945K budget, storage across AWS/Google/Azure

Live-queried dashboard covering all compute clusters (35 total, 7 owned by Terry), cost & budget analysis ($91.6K of $945K spent across 8 groups), storage inventory (33 resources: buckets, NFS, Lustre, disks), networking (4 VPCs, 1 static IP), active sessions, ML workspaces, and platform configuration. Data collected automatically via 26 PW MCP tool queries against the Parallel Works REST API.

🔌 PW MCP Toolset Documentation — 26 Tools for Cloud Infrastructure AI

(March 6, 2026)

PW-MCP-Toolset-Documentation — Complete reference for the Parallel Works MCP Server: 26 tools across 7 categories

Documents every tool in the parallel-works-mcp server including authentication, compute management, cost analysis, storage (6 tools), networking (3 tools), workflows, and ML workspaces. Covers the Phase 37 SDD expansion (19 → 26 tools, 548 LOC added), API endpoint coverage map (24 endpoints), architecture diagram, and configuration guide. All tools live-tested and verified against the PW v7.15.1 API.

🔧 Neo4j GraphRAG Ingestion Pipeline

(March 5, 2026)

Neo4j-GraphRAG-Ingestion-Pipeline — Complete guide to how the 41K-node, 589K-relationship knowledge graph gets built from source code

Documents all 15+ ingestion scripts across 4 pipeline stages: Fortran/Python/Shell code graph creation, vector embeddings, documentation crawling, and hierarchical community detection with LLM summarization. Includes execution order, node/relationship counts, and architecture diagrams.

📋 MCP-RAG Platform 32-Day Achievement Synopsis

(March 3, 2026)

MCP-RAG-Platform-32-Day-Achievement-Synopsis — Five major breakthroughs from Jan 30 – Mar 3, 2026 (v7.10.0 → v7.25.1)

Sixteen releases spanning hierarchical GraphRAG communities (1,036 nodes, 828 LLM summaries), cross-language graph unification (Shell→Fortran→Python), 5 new agentic MCP tools with session state tracking, NCEPLIBS documentation ingestion, and instruction file architecture that reduced agent context window usage by 35%. Neo4j relationships grew from ~485K to 589K; ChromaDB documents from ~60K to 66.5K; 14 SDD sessions completed with 0 abandoned.

📐 SDD Workflow — Concept to Code via CLI /plan Handoff

(February 24, 2026)

Interactive SVG diagram of the full Spec-Driven Development pipeline

This visual reference documents the end-to-end development process used by the EIB MCP-RAG platform, where Spec-Driven Development (SDD) governs the lifecycle from conceptual design through autonomous code implementation and back to human review.

Two Execution Modalities, One Session Model

Lane	Steps	Actor
Human + AI (IDE)	1. Discover → 2. Spec Design → 3. Resolve Decisions → 4. Prep Handoff	Interactive (VS Code Copilot)
CLI Agent (--yolo)	5. /plan Decompose → 6. Start Session → 7. Implement + Record	Autonomous (Copilot CLI)
Validation Gates	generate-tool-docs --check → npm test → health check → Docker rebuild → Gateway verify	Automated
Persistent State	active_session.json, history.jsonl, checkpoints/, workflows/	Phase 31 SessionManager

The diagram illustrates 8 numbered steps across swim lanes, with the CLI /plan handoff (Step 4→5) as the bridge between human-guided design and autonomous implementation. Both modalities share the same filesystem-based session state (Phase 31 model), enabling real-time monitoring from either side.

Includes embedded description with validation pipeline details, persistent state lifecycle, and modality-aware execution principles. If the inline preview is blocked by your browser, open the HTML file directly.

🏗️ MILESTONE: GraphRAG Hierarchical Community Materialization

(February 24, 2026)

GraphRAG-Hierarchical-Community-Materialization — 4-Level Navigable Knowledge Graph of the Global Workflow

The NOAA Global Workflow now has a hierarchical community structure in Neo4j — 1,036 Community nodes across 4 levels, enabling multi-resolution understanding of how 40,000+ code entities organize into subsystems and how those subsystems interact.

Key Results

Metric	Before	After
Community nodes	0	1,036 (L0: 694, L1: 175, L2: 86, L3: 81)
MEMBER_OF relationships	0	21,559
PARENT_OF hierarchy	0	978 edges (valid acyclic tree)
INTERACTS_WITH edges	0	1,297 cross-community links
Community summaries	63 flat	828 hierarchical (4 levels)

What This Means

Ask "How does data assimilation interact with the forecast model?" and the system traverses L3 → L2 → L1 → L0 communities, returning subsystem boundaries, interaction strengths, and member-level detail — not just text similarity matches. This is Graph-Guided Semantic Retrieval operating on a structural map of one of the most complex computational workflows on Earth.

Phase 24E-5 — Spec-Driven Development, dual-agent execution + independent verification, 6-minute implementation, 6/6 tests passing.

�🎯 BREAKTHROUGH: Dynamic MCP Server Self-Provisioning

(January 13, 2026)

Dynamic_MCP_Server_Self_Provisioning — LLM Agents That Expand Their Own Capabilities

We have achieved a paradigm shift in agentic AI: an AI assistant that can discover, configure, and activate new tool servers autonomously through the Docker MCP Gateway—without CLI commands, config files, or restarts.

Live Demonstration

When asked about coupled modeling research papers, the LLM autonomously:

Step	MCP Tool Used	Result
1. Discover	`mcp-find`	Found `arxiv-mcp-server` in catalog
2. Configure	`mcp-config-set`	Set storage path
3. Activate	`mcp-add`	Added 4 new tools live
4. Execute	`mcp-exec`	Searched arXiv, returned papers

No CLI. No config files. No restarts. Pure MCP tool orchestration.

This transforms the agent from a static tool user to a dynamic capability builder—recognizing what it needs and acquiring those capabilities autonomously.

Gateway management tools: mcp-find, mcp-add, mcp-remove, mcp-config-set, mcp-exec, mcp-create-profile, code-mode

🚀 Advanced Future Work - MCP/RAG System Evolution

(January 6, 2026)

ADVANCED_FUTURE_WORK — Strategic Development Roadmap for Q2 2026 Funding Cycle

This comprehensive roadmap outlines the next evolutionary phase of the MCP/RAG system: intelligent, self-improving AI assistance that learns from operational history, understands visual system representations, and provides truly graph-aware semantic reasoning.

Three Transformational Initiatives

Initiative	Impact	Timeline
Multi-Modal Visual Understanding	High	Q2 2026
Self-Learning from CI/CD History	Very High	Q2-Q3 2026
True GraphRAG Fusion	Transformational	Q3 2026

Proof of Concept: GFS v16 Flowchart Analysis

The document includes a demonstration of multi-modal AI comprehension applied to the GFS v16 Global Model Parallel Sequencing flowchart. Key insights extracted:

42 job nodes identified across three swim lanes (GDAS, Hybrid EnKF, GFS)
~60 DEPENDS_ON relationships mapped from visual arrows
Critical synchronization point at +06 cycle hour between GDAS and GFS
Cascade failure analysis: A query like "What happens if eupd fails?" could traverse the entire downstream dependency chain

This single diagram encodes more operational knowledge than 50 pages of text.

Estimated team requirement: 3-4 FTEs + LLM fine-tuning expertise

VS Code CLI Tunnel Command Reference

VSCODE_CODE_CLI_TUNNEL_REFERENCE — Comprehensive guide to code tunnel and related remote server commands for VS Code CLI 1.107.1.

The VS Code tunnel feature enables secure remote development through vscode.dev from anywhere—critical for accessing HPC login nodes, cloud VMs, and CI/CD environments without traditional SSH port forwarding.

Key Capabilities:

🔐 Remote Tunnels - Access any machine via browser at vscode.dev/tunnel/<name>
⚙️ System Service Mode - Persistent always-on connections with code tunnel service install
🔑 Authentication - GitHub/Microsoft login with token-based automation support
📦 Extension Management - Pre-install extensions on remote servers
🖧 Local Web Server - Run VS Code web UI locally with code serve-web

HPC Use Case Example:

# On Hera login node
code tunnel --name hera-login --no-sleep

# Access from anywhere: https://vscode.dev/tunnel/hera-login

Essential reference for remote development workflows on RDHPCS platforms.

�️ Machine and System Conditionals Reference (January 2026)

Machine_System_Conditionals — Comprehensive guide to all platform-specific conditionals in the Global Workflow codebase

This reference documents every location where the codebase performs conditional operations based on the HPC system or machine where the code executes. Critical for platform portability, debugging system-specific issues, and onboarding new HPC platforms.

Quick Reference

Category	Count
Supported Platforms	11 (Hera, Ursa, Orion, Hercules, WCOSS2, Gaea C5/C6, AWS/Azure/Google PW, Container)
Shell Detection Scripts	11+ files with `MACHINE_ID` conditionals
Python Detection	`hosts.py` with `Host` class
Host YAML Configs	11 files in `workflow/hosts/`

Key Files:

ush/detect_machine.sh — Primary machine detection (hostname + path-based)
ush/module-setup.sh — Platform-specific module loading
dev/workflow/hosts.py — Python Host class with SUPPORTED_HOSTS

Essential for developers adding new platform support or debugging platform-specific issues.

🐳 Docker MCP Gateway Multi-User Architecture Analysis (January 12, 2026)

Docker_MCP_Gateway_MultiUser_Architecture — Comprehensive analysis of Docker MCP Gateway v0.35.0 architecture options for multi-user RDHPCS deployments

This document addresses the challenge of container accumulation when multiple SME developers access the MCP/RAG system via VS Code Remote Tunnels. After deep investigation of the Docker MCP Gateway source code, we discovered that container spawning per session is the intended design, not a bug.

Four Architecture Options Evaluated

Option	Approach	Effort	Memory per User
A: type: remote	Gateway proxies to HTTP server	4-6 hrs	~200MB
B: Default + Cleanup	Accept container spawning, add cron	30 min	~2GB
C: Direct stdio	Skip gateway for VS Code	0	~200MB
D: Hybrid ⭐	stdio for VS Code, gateway for external	4-6 hrs	~200MB

Recommended: Hybrid Architecture (Option D)

VS Code Sessions → Direct stdio (mcp.json) → Node.js process (~200MB)
External Clients → Gateway (type:remote) → HTTP MCP Server (:3000)
Both paths share → ChromaDB + Neo4j databases

Key Insight: VS Code Copilot works excellently with lightweight Node.js stdio processes. Reserve the Docker MCP Gateway for external HTTP clients (n8n, Claude Desktop, API consumers) where container-mediated access adds security value.

Implementation: Three-phase rollout starting with immediate gateway disabling for VS Code, followed by HTTP transport implementation for external clients.

Full analysis with source code references and implementation plan.

📄 Docker MCP Gateway Technical Paper (December 15, 2025)

Docker MCP Gateway: Enabling MCP-as-a-Service for Enterprise AI Integration (PDF, 11 pages)

This comprehensive technical paper documents how Docker MCP Gateway transforms the Model Context Protocol from a single-client development tool into enterprise-ready multi-client infrastructure. The gateway bridges stdio and HTTP/SSE transports, enabling multiple AI clients (VS Code Copilot, Claude Desktop, LangFlow) to share common MCP tools simultaneously.

Key Topics Covered:

🔄 MCP Transport Mechanisms - Why stdio limits single-client usage and how SSE enables network access
🏗️ Gateway Architecture - Protocol bridging, session management, and container lifecycle
🔒 Security Model - Network isolation, authentication, and resource limits
🚀 NOAA Implementation - Production deployment with 32 tools, ChromaDB (14,854 docs), Neo4j (85,894 relationships)
📋 Lessons Learned - Docker CE compatibility, label format requirements, network trade-offs

📥 Download PDF

Paper authored December 15, 2025 | NOAA EMC Global Workflow MCP Team

🧠 Phase 2 Semantic Annotation Architecture (December 4, 2025)

Breakthrough Achievement: 85% reduction in AI false positives through SME-driven semantic annotations embedded directly in technical standards documentation.

The Problem We Solved

AI-generated EE2 compliance recommendations suffered from systematic false positives—the AI was recommending patterns not actually required by NCEP operational standards (e.g., set -eu when only set -x is mandated). Traditional approaches required code changes for every correction.

The Solution: Semantic Annotations

Semantic annotations are machine-readable knowledge embedded in RST documentation that teach AI systems what patterns to recommend—and what to avoid:

.. mcp:anti_pattern:: adding_set_e_or_set_eu
   :severity: must_not
   :context: operational_scripts
   :sme_justification: Not present in EE2 standards or examples
   :evidence: standards.rst lines 588-595

Why This Matters for NOAA:

Before (Phase 1)	After (Phase 2)
328 false positive violations	48 legitimate violations
Hard-coded rules in JavaScript	SME-maintained RST annotations
Changes required programming	Zero code changes to update rules
No evidence trail	Complete traceability to EE2 source

Documentation Suite

PHASE_2_HYBRID_ARCHITECTURE_SPECIFICATION - Complete technical specification of the hybrid architecture that generates runtime configuration from semantic embeddings. Covers the 5-component pipeline (EE2 Standards → Annotations → ChromaDB → JSON Config → Scan Tool), validation results, and scalability analysis. Essential reading for understanding how semantic intelligence achieves runtime performance.
SME_Training_QuickStart - Practical 2-hour training guide for Subject Matter Experts on creating and reviewing semantic annotations. Includes linguistic framework (for translators/language experts), the 7 MCP directive types, and hands-on exercises. Enables domain experts to maintain compliance intelligence without programming.
SME Training QuickStart Guide (PDF) - Printable version of the training materials for offline use and in-person training sessions.

Architectural Innovation

The "hybrid" pattern combines the best of two worlds:

┌─────────────────────────────────────────────────────────────┐
│  BUILD TIME: Semantic Intelligence                          │
│  ChromaDB embeddings + Neo4j relationships → JSON Config    │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│  RUNTIME: Static Performance                                │
│  Load JSON once → O(1) lookup per file → Zero DB queries    │
└─────────────────────────────────────────────────────────────┘

Result: Semantic understanding WITHOUT runtime database queries. Scan 647 files in 12 seconds with full evidence traceability.

Impact on AI-Assisted Development

This architecture enables a new paradigm for expert-in-the-loop AI development:

AI generates compliance recommendations using RAG-enhanced search
SMEs review and identify false positives
Annotations capture corrections in machine-readable form
Pipeline regenerates configuration automatically
AI learns without code changes

This is institutional knowledge preservation—capturing what experts know in a form that makes AI smarter.

📋 RAG Embedding Space Theory (December 19, 2025)

RAG_manifolds — Dimensional Conformality in Vector Databases: The Mathematical Foundation of RAG Embedding Spaces

A deep dive into why query and document embeddings must inhabit the same metric space for semantic search to work. On the surface, the cosine similarity formula is undergraduate linear algebra — but the 768-dimensional feature spaces encode recursive linguistic structures, emergent semantic geometry, and the holistic paradox of meaning encoded in vectors. Covers the mathematical foundations (SBERT, DPR, RAG papers), the superposition hypothesis, and the philosophical implications of meaning-as-geometry.

"The feature spaces are indeed a true enigma of recursive and holistic complexities — basic on the surface, infinitely deep upon reflection."

�🚀 Previous Update: Embedding Model Upgrade (November 5, 2025)

Successfully upgraded the RAG system from all-MiniLM-L6-v2 (384 dimensions) to all-mpnet-base-v2 (768 dimensions), delivering a 50-100% improvement in semantic search quality for domain-specific queries. Empirical testing revealed the previous model achieved only 0.174-0.411 similarity scores on critical workflow terms (below the 0.5 acceptable threshold), prompting an immediate zero-cost upgrade. The new v4 collection achieved 73% completion (532/730 documents), enabling more accurate contextual AI assistance for global-workflow development and operations, with A/B testing and production cutover planned for completion. Full progress report.

Development followed the Empirical Accuracy Principle: all technical claims verified through measurement rather than assumption, ensuring trustworthy AI-assisted development practices.

🎯 MCP Tool Architecture: 21-Tool Agentic AI Platform

Comprehensive documentation of the Model Context Protocol (MCP) server architecture that transforms GitHub Copilot from code completion to autonomous development assistance.

MCP_TOOL_ARCHITECTURE - Deep dive into the 21 specialized tools organized into 5 functional categories:

WorkflowInfoTools (3) - Foundation layer with instant structural awareness
CodeAnalysisTools (4) - Graph-based relationship intelligence via Neo4j
SemanticSearchTools (7) - RAG-enhanced knowledge retrieval with ChromaDB
OperationalTools (3) - Deep domain intelligence for HPC operations
GitHubTools (4) - Repository and project collaboration intelligence

Why This Matters: This architecture represents a paradigm shift from "AI that writes code" to "AI that understands systems." By combining filesystem analysis, graph databases (Neo4j), vector embeddings (ChromaDB), and semantic search, the MCP platform enables:

Autonomous research across documentation, code, and issues
Impact analysis before making changes (dependency graphs)
Compliance verification (EE2 standards) during development, not after
Operational intelligence with HPC-specific guidance
Collaborative awareness of ongoing work and project history

Configuration Modes:

full - All 21 tools (complete development environment)
core - 7 tools (minimal, no databases required)
rag - 17 tools (RAG without GitHub integration)

The Result: A fully-functional integrated agentic software development platform that doesn't just generate code - it understands architecture, follows standards, prevents breaking changes, and collaborates effectively. This is the future of weather model development at NOAA.

Documentation Status: Version 3.0.0 | Week 2 Consolidated Architecture | November 4, 2025

🔒 EE2 Compliance & Operational Readiness

NCEP Central Operations Compliance Analysis

Comprehensive EE2 compliance audits conducted using the MCP (Model Context Protocol) RAG infrastructure with hybrid semantic search (ChromaDB) and graph-based code analysis (Neo4j). These AI-assisted analyses examined hundreds of job scripts, execution scripts, and utility libraries to identify critical compliance gaps and provide production-ready remediation plans.

Global Workflow EE2 Analysis

EE2_COMPLIANCE_ANALYSIS_GLOBAL_WORKFLOW - 40+ page comprehensive audit of the global-workflow repository identifying top 5 critical compliance issues blocking operational deployment. Analysis covers 255+ files (172 job scripts, 83 execution scripts, utilities) with detailed remediation plans, production-ready code examples, and 14-week phased implementation timeline.

Key Findings:

Issue #1 (CRITICAL): Python error handling - 42 scripts lack try-except blocks
Issue #2 (HIGH): Shell error exits - && true pattern defeats error detection
Issue #3 (HIGH): Environment variable validation - ${PDY:-} defaults to empty
Issue #4 (MEDIUM-HIGH): Weak utility error handling - envsubst failures silent
Issue #5 (MEDIUM): Inconsistent set -e and missing trap handlers

Provenance: Generated via static analysis and MCP RAG tools examining NOAA-EMC/global-workflow fork using ChromaDB semantic search (730 docs) and Neo4j graph analysis (8709 relationships). Analysis date: November 3, 2025.

RRFS Workflow EE2 Analysis

EE2_COMPLIANCE_ANALYSIS_RRFS - Comprehensive EE2 compliance analysis of the RRFS (Rapid Refresh Forecast System) workflow repository. Examined 142+ files (26 jobs, 27 scripts, 35+ utilities, 54 Python modules) using MCP RAG tools. Key discovery: RRFS has better baseline compliance than global-workflow (consistent set -xue, custom error functions) but shares critical gaps. 10-week implementation plan with priority remediation targets.

Key Findings:

Issue #1 (CRITICAL): Missing err_chk function - All 26 job scripts call undefined function
Issue #2 (HIGH): Python error handling - Better structure than global-workflow but incomplete
Issue #3 (HIGH): Environment variable validation - Empty string defaults risk invalid paths
Issue #4 (MEDIUM-HIGH): No trap handlers - Resource leaks on failures
Issue #5 (MEDIUM): Insufficient error context - Good foundation needs enhancement

RRFS Advantages: Uses set -xue consistently (vs. set -e workarounds), custom print_err_msg_exit with caller context, filesystem operations use *_vrfy wrappers.

Provenance: Generated via MCP RAG hybrid analysis (semantic + graph) of NOAA-EMC/rrfs-workflow repository using ChromaDB vector search and Neo4j dependency mapping. Analysis date: November 3, 2025.

🚀 Advanced RAG & Graph Intelligence Infrastructure

Strategic Architecture Documents

The Global Workflow development infrastructure has evolved to incorporate state-of-the-art RAG (Retrieval-Augmented Generation) and Graph Database technologies, enabling sophisticated agentic AI capabilities for GFS software management and error analysis.

Core Infrastructure Documentation

README_PROVISIONING_V3.1_COMPLETE - Complete provisioning guide for the MCP RAG persistent infrastructure on ParallelWorks cloud platform. Covers ChromaDB 1.1.1 deployment, Node.js MCP server setup, LangFlow integration, and systemd service configuration for production-grade persistent storage architecture.
ENHANCED_INGESTION_ARCHITECTURE - Comprehensive design for Context7-inspired multi-source RAG ingestion across 50+ GFS submodules (3-5M LOC). Details the hybrid triple-store architecture combining ChromaDB (semantic search), Neo4j (graph relationships), and PostgreSQL (temporal data) for intelligent error diagnosis and code understanding.
CHROMADB_MIGRATION_COMPLETE - Technical documentation of ChromaDB 0.4.x to 1.1.1 migration, including API compatibility updates, Node.js client integration ([email protected]), and resolution of embedding dimension mismatches for production stability.

Why Graph RAG for GFS Complexity?

The Challenge: The Global Forecast System represents one of the most complex software ecosystems in scientific computing:

50+ interconnected repositories (UFS, GDAS, GSI, GOCART, MOM6, CICE, WW3, etc.)
3-5 million lines of code across Fortran, Python, C/C++, and CMake
Deep dependency chains spanning atmospheric dynamics → ocean coupling → data assimilation → post-processing
Multi-component interactions that traditional documentation cannot capture

The Solution: Hybrid Graph + Vector RAG Architecture

Traditional vector-based RAG (ChromaDB alone) excels at semantic similarity but cannot answer structural questions:

❌ "What components are affected if I change FV3 dynamics?"
❌ "What's the dependency chain causing this compilation error?"
❌ "Which CMakeLists.txt needs to link the GSW library?"
❌ "Show me the call graph from model initialization to MPI communication"

Graph RAG (Neo4j + ChromaDB) enables these capabilities:

Error Analysis Workflow:
├─ Semantic Search (ChromaDB): Find similar errors and solutions
├─ Structural Analysis (Neo4j): Trace dependency chains and call graphs
├─ Temporal Context (PostgreSQL): Recent commits and regression patterns
└─ LLM Synthesis: Root cause + Fix instructions + Prevention recommendations

Agentic AI for GFS Software Management

The MCP (Model Context Protocol) server provides LLM agents with:

Deep Code Understanding: Not just text search, but comprehension of component interactions
Error Diagnosis: 10x faster debugging by combining similar past errors with structural impact analysis
Impact Prediction: "What breaks if I change X?" before making changes
Knowledge Retention: Institutional expertise captured in graph relationships
Cross-Component Reasoning: Trace errors through UFS → GSI → GDAS → GFS pipeline

Result: Transform debugging from "search documentation and guess" to "query knowledge graph and know."

Implementation Status

✅ ChromaDB 1.1.1: Production vector database operational
✅ Node.js MCP Server: 17 tools for workflow management and RAG search
✅ LangFlow UI: Visual workflow builder for RAG pipelines
🚧 Neo4j Graph DB: Phase 0 POC approved, weekend implementation planned
📋 Enhanced Ingestion: Multi-source ingestion pipeline designed for 50+ repos

Next Milestone: Neo4j proof-of-concept demonstrating dependency graph queries that ChromaDB cannot answer.

📖 NCEPLIBS-BUFR Error Catching Initiative

Core Documentation

PR673_Comprehensive_Analysis - Complete technical analysis of PR #673 which introduced error catching capability to NCEPLIBS-bufr. This 50+ page analysis covers the architectural design using setjmp/longjmp, implementation details across 51 files, code review insights, testing strategy, and operational impact for NOAA's weather forecasting infrastructure.
ERROR_CATCHING_IMPLEMENTATION_PLAN - Detailed 17-week implementation plan for extending error catching to 24 additional I/O routines following the PR #673 pattern. The plan divides work into 4 phases by complexity level, includes automated testing frameworks, CI/CD strategies, and comprehensive quality assurance checklists.
additional_io_routines_for_error_catching - Comprehensive inventory of 38 additional I/O routines organized into 7 complexity levels for systematic error catching implementation. This reference document provides technical details, implementation priorities, and success metrics for achieving complete API coverage in the BUFR library.

🧪 Background information of Cases used in the CTest Framework

The CTest framework provides self-contained test cases for validating individual workflow components. Each test creates an isolated environment with staged inputs from nightly stable baseline runs, enabling independent testing and validation.

C48 Fixed Atmosphere-Only Tests (ATM)

C48_ATM-gfs_fcst_seg0
- 120-hour deterministic forecast validation (209 output files)
- 13 input files (atmosphere initial conditions)
- 18 output files (forecast history files)
- C48_ATM-gfs_fcst_seg0.yaml
C48_ATM-gfs_atmos_prod_f000-f002
- Atmosphere product generation test (f000, f001, f002)
- 5 input files (forecast history from f000, f001, f002)
- 12 output files (post-processed products)
- C48_ATM-gfs_atmos_prod_f000-f002.yaml

C48 Coupled System Tests (S2SW)

C48_S2SW-gfs_fcst_seg0
- 48-hour coupled atmosphere-ocean-ice-wave forecast
- Fixed coupled forecast test with proper restart staging
- 17 input files (13 atmosphere ICs + 3 restarts + 1 wave prep)
- 24 output files (18 atmos + 2 ocean + 2 ice + 2 wave)
- Key fix: Added H_offset = '-6H' for staging restart files from previous cycle
- C48_S2SW-gfs_fcst_seg0.yaml
C48_S2SW-gfs_ocean_prod_f006 - Ocean product generation at forecast hour 6
- 2 input files (ocean forecast at f006)
- 2 output files (ocean products)
- C48_S2SW-gfs_ocean_prod_f006.yaml
C48_S2SW-gfs_ice_prod_f006
- Ice product generation at forecast hour 6
- 2 input files (ice forecast at f006)
- 2 output files (ice products)
- C48_S2SW-gfs_ice_prod_f006.yaml

C48 Ensemble Tests (S2SW_gfs)

C48_S2SWA_gefs-gefs_fcst_mem001_seg0
GEFS ensemble member 001 coupled forecast (48-hour segment)
- Implemented GEFS ensemble member 001 forecast test
- 17 input files with unique two-cycle pattern:
  - 13 atmosphere ICs from current cycle (12Z)
  - 3 restart files from previous cycle (06Z)
  - 1 wave prep file from current cycle (12Z)
- 24 output files (ensemble forecast outputs)
- GEFS requires different source cycles for ICs vs restarts
- Special handling for mem001/ subdirectory structure
C48_S2SWA_gefs-gefs_fcst_mem001_seg0.yaml

Framework Features:

Self-contained test environments with isolated EXPDIR
Input staging from STAGED_CTESTS (stable nightly runs)
Consistent naming convention: CASE-JOB.yaml
Comprehensive validation with input/output file verification

� CI Error Analyses (MCP-RAG Assisted)

Detailed root-cause analyses of CI failures, performed using the EIB MCP-RAG GraphRAG toolset. Each report includes execution chain tracing, environment variable dependency mapping, and an MCP tool call scorecard.

C96_atm3DVar-gdas_atmos_prod_f000-Error-Analysis-PR4359 — Unbound variable paramlistb_f000 in exglobal_atmos_products.sh. Caused by PR #4347 adding GCAFS-specific variables to a shared script without updating all config variants. Reverted via PR #4360. (Dec 2025)
C96C48_hybatmDA-JGLOBAL_ENKF_SFC-Error-Analysis-PR4327 — Missing COMROOT/date/t00z file causing silent setpdy.sh failure, propagated through jjob_header.sh error suppression pattern. (Dec 2025)
C96C48mx500_S2SW_cyc_gfs-atmos_prod_f102_WRITE_ERROR — Write error during atmosphere product generation at forecast hour 102.
C48_ATM_fail — C48 ATM test case failure analysis.

MCP/RAG Baseline Tests

MCP-RAG-Baseline-Test_C96_atm3DVar_extended_Analysis — Baseline quality benchmark analyzing the C96_atm3DVar_extended CI case. 11 MCP tool calls scored for effectiveness. Serves as comparison baseline for AWS Bedrock port. (March 2026)

�🔧 CI/CD & DevOps

GitLab CI/CD Pipeline

Jenkins Integration

GitHub & Jenkins Integration

🤖 AI/ML & Intelligent Tools

Model Context Protocol (MCP)

Q-Dev-Health-Check-2026-01-28 - Amazon Q Developer health check transcript showing MCP server v3.6.2 with 35 tools, ChromaDB (14,968 docs), Neo4j (86K relationships), and SDD framework (37 workflows) all operational.
GitHub-MCP-Tools-installed-for-global‐workflow-software-development-and-how-they-work
Global‐workflow-RAG-added-to-MCP-server
RAG-enhanced-MCP-server-configured-and-all-8-tools-are-now-available
MCP-RAG-Development-Status
Differences-and-Similarities-between-MCP-(Model-Context-Protocol)-and-RAG-(Retrieval‐Augmented-Generation)-in-agentic-LLM-pipeline
MASSIVE-IMPROVMENT
Opps-‐-no-RAG-no-go

AI Development Tools

🌊 Workflow Management Systems

Rocoto Workflow Engine

CROW & EcFlow

🌐 Weather Modeling & Configuration

GCAFS (Global Composition/Chemistry Aerosol Forecast System)

GCAFS-Overview - Comprehensive analysis of NOAA's next-generation aerosol and air quality forecasting system. Documents GCAFS architecture, its relationship to global-workflow, development timeline (4,040 commits since 2016), key contributors (Barry Baker, Li Pan, Cory Martin), and operational readiness status. GCAFS represents the fourth major forecasting capability alongside GFS, GEFS, and SFS, integrating the GOCART model for aerosol transport/chemistry. Analysis date: January 30, 2026

Model Configuration

💻 HPC System Administration

MPMD & MPI Runtime Infrastructure

MPMD_MPI_Runtime_Infrastructure - Comprehensive documentation of the Multi-Program Multiple-Data (MPMD) execution framework and MPI runtime configuration across all 11 supported HPC platforms. Covers the core run_mpmd.sh orchestration script, platform-specific launcher configurations (Slurm srun --multi-prog vs PBS mpiexec cfp), MPI tuning parameters (Intel MPI, Cray MPICH, PMI2), network fabric details (InfiniBand, Slingshot, EFA), and the three-level job resource configuration chain. Essential reference for HPC operations, platform portability, and debugging parallel execution issues. Analysis date: January 30, 2026
MPMD & MPI Runtime Infrastructure Technical Paper (PDF, 17 pages) - Detailed LaTeX technical specification with architecture diagrams, algorithm pseudocode, platform comparison tables, MPI tuning parameters, and complete environment file appendices.

Resource Configuration: CROW vs Global-Workflow

Resource Configuration Comparison Technical Paper (PDF, 44 pages) - Comprehensive technical analysis comparing the declarative CROW system (2016-2020) with the current imperative Global-Workflow approach (2020-present). Includes TikZ architecture diagrams, algorithm pseudocode, detailed code examples from the CROW YAML DSL and current shell-based configuration, MPMD runtime integration analysis, validation pipeline recommendations, and architectural guidance for next-generation workflow infrastructure. Covers the complete resource specification lifecycle from definition through validation to runtime execution. Analysis date: January 30, 2026

System Configuration

🔬 Research & Theory

Scientific Computing

Discrete-Memetic-Operators-advancing-electrodynamic-field-theory

🐛 Development & Debugging

Bug Fixes & Solutions

Deleted-suggestion-got-back-make-it-a-bug-...

Development Process

Development-Processes-and-Ideas

📚 Quick Reference

Most Viewed Topics:

CI/CD Pipeline Architecture
Rocoto Workflow Management
MCP/RAG Integration
Jenkins Configuration
HPC System Setup

Latest Updates:

Phase 2 Semantic Annotation Architecture (December 2025)
SME Training for Semantic Annotations (December 2025)
Hybrid Build-Time/Runtime Compliance Validation
MCP Server RAG Enhancement
AI-Assisted Development Tools

This wiki is actively maintained. Last organized: December 2025