GraphRAG Hierarchical Community Materialization - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

GraphRAG Hierarchical Community Materialization

Milestone Achievement Report — Phase 24E-5
Date: February 24, 2026
Author: Terrence McGuinness, Enterprise Infrastructure Branch, NOAA EMC
AI Collaboration: Claude Opus 4.6 (Anthropic) — dual-agent execution + verification
Commit: 27ad4e5 on develop | Version: v7.20.0


Executive Summary

The NOAA Global Workflow — a very complex computational systems, coupling atmospheric, ocean, sea ice, land surface, and chemistry models across multiple HPC platforms — now has a navigable hierarchical knowledge graph. Phase 24E-5 materialized 1,036 community nodes across 4 hierarchical levels, creating a first-class graph structure that enables multi-resolution understanding of how the Global Forecast System's subsystems are organized and interact.

This closes the final structural gap in the Graph-Guided Semantic Retrieval (GGSR) system. The platform now supports drill-down queries: from a global question like "how does data assimilation interact with the forecast model?" the system can traverse L3 → L2 → L1 → L0 community hierarchies, showing subsystem boundaries, interaction strengths, and member-level detail.


The Problem

Before Phase 24E-5, community detection had been run against the graph but only produced flat property tags — a communityId integer written to 25,352 nodes. There were no navigable community structures:

  • No (:Community) nodes in Neo4j
  • No MEMBER_OF, PARENT_OF, or INTERACTS_WITH relationships
  • No hierarchical levels (Leiden algorithm was run single-level only)
  • 63 template-based summaries in ChromaDB with no parent-child structure

This meant the GGSR global query path (retrieveGlobal()) could only do flat text search across 63 summaries — no structural traversal, no drill-down, no inter-subsystem interaction analysis.

For a system modeling Earth's climate across coupled atmosphere-ocean-ice-land-chemistry domains, flat community membership is insufficient. Understanding how ~40,000 code entities organize into subsystems and how those subsystems interact requires hierarchy.


What Was Built

Architecture

                    ┌──────────────────────┐
                    │   Level 3 (81 nodes) │  ← Global subsystems
                    │   Top-level clusters │
                    └──────────┬───────────┘
                               │ PARENT_OF
                    ┌──────────┴───────────┐
                    │   Level 2 (86 nodes) │  ← Major components
                    │   Sub-subsystems     │
                    └──────────┬───────────┘
                               │ PARENT_OF
                    ┌──────────┴───────────┐
                    │  Level 1 (175 nodes) │  ← Functional modules
                    │  Module clusters     │
                    └──────────┬───────────┘
                               │ PARENT_OF
                    ┌──────────┴───────────┐
                    │  Level 0 (694 nodes) │  ← Leaf communities
                    │  Tightly-coupled     │
                    │  function groups     │
                    └──────────┬───────────┘
                               │ MEMBER_OF
                    ┌──────────┴───────────┐
                    │  Code Nodes (21,559) │  ← Fortran, Python,
                    │  Subroutines, funcs, │    Shell, modules
                    │  modules, programs   │
                    └──────────────────────┘

              ← INTERACTS_WITH edges at every level →
                 (1,297 cross-community links)

Query Flow: Before vs After

Before (flat):

User: "How does data assimilation interact with the forecast model?"
  → search ChromaDB 'community-summaries' (63 flat docs)
  → return top-3 text matches
  → no structural context, no drill-down

After (hierarchical):

User: "How does data assimilation interact with the forecast model?"
  → search ChromaDB 'community-summaries' (828 level-tagged docs)
  → prefer Level 2-3 matches for global context
  → for each match, drill down via PARENT_OF → child summaries
  → include INTERACTS_WITH edges (strength, relationship types)
  → return structured multi-resolution answer

Verified Results

All metrics verified independently against live Neo4j and ChromaDB instances (verification agent separate from implementation agent).

Graph Structure

Metric Before After Change
Total nodes 40,319 41,355 +1,036
Total relationships 565,562 589,396 +23,834
(:Community) nodes 0 1,036
Hierarchy levels 0 4
MEMBER_OF edges 0 21,559
PARENT_OF edges 0 978
INTERACTS_WITH edges 0 1,297
ChromaDB summaries 63 flat 828 hierarchical 13x
Nodes with communityLevels 0 25,377

Community Distribution by Level

Level Communities Role Largest
L0 (leaf) 694 Tightly-coupled function groups 3,292 members
L1 175 Functional module clusters 3,417 members
L2 86 Major subsystem components 3,478 members
L3 (root) 81 Global workflow subsystems 3,426 members

Code Coverage by Language

Node Type Nodes in Communities Coverage
FortranSubroutine 13,479 99.6%
PythonFunction 3,198 97.9%
FortranFunction 2,250 95.5%
FortranModule 1,521 98.8%
PythonModule 614 98.4%
PythonClass 248 100%
FortranProgram 165 97.6%
File 84
Total 21,559

Interaction Hotspots (Strongest Cross-Community Links)

Community A Community B Strength Level
L0_6726 L0_6935 3,270 0
L1_0 L1_1 2,484 1
L3_762 L3_763 2,484 3
L2_2850 L2_1875 2,484 2
L0_19157 L0_9444 2,443 0

These high-strength interactions represent the heaviest cross-subsystem communication patterns in the Global Workflow — the boundaries where atmospheric analysis calls into the dynamic core, where ensemble members share state, and where ocean-atmosphere coupling occurs.

Tree Integrity

Check Result
PARENT_OF cycles 0 (acyclic)
Single-parent compliance 938/955 (98.2%)
Multi-parent nodes 17 (1.8%) — inherent Leiden boundary behavior
Summaries in Neo4j 828/1,036 (80%) — singletons excluded by design
Summaries in ChromaDB 828 (matches Neo4j)

Test Results

 ✓ Community nodes exist at 3+ hierarchical levels        (17ms)
 ✓ MEMBER_OF relationships link code nodes to L0           (6ms)
 ✓ PARENT_OF tree is valid (acyclic, single parent)        (7ms)
 ✓ INTERACTS_WITH edges capture cross-community comms      (5ms)
 ✓ Community nodes have summaries in Neo4j                 (4ms)
 ✓ Community nodes have metadata (languages, keyMembers)   (3ms)

 Test Files  1 passed (1)
      Tests  6 passed (6)
   Duration  402ms

Implementation

Method: Spec-Driven Development (SDD)

Phase 24E-5 was specified before implementation in sdd_framework/workflows/phase24e_hierarchical_communities.md (v1.1.0). The spec defined 10 execution steps, success criteria, and expected metrics. Implementation was performed by an AI agent via GitHub CLI; verification was performed independently by a separate AI agent thread.

Execution Timeline

Step Duration Description
1 Re-run Leiden with includeIntermediateCommunities: true
2 Create (:Community) label nodes with uniqueness constraint
3 Create MEMBER_OF relationships (code node → L0 community)
4 Create PARENT_OF hierarchy (L0 → L1 → L2 → L3)
5 Compute INTERACTS_WITH between communities at each level
6 Enrich Community nodes with metadata (languages, key members)
7 Generate bottom-up hierarchical summaries (L0 → L1 → L2 → L3)
8 Update retrieveGlobal() with level-aware search + drill-down
9 Add --materialize flag to pipeline script
10 Create integration tests, validate, commit
Total 6 min SDD session session_2026-02-24_lluu6h

Files Changed

File Lines Changes
src/graphrag/CommunityDetection.js +291 runHierarchicalLeiden(), materializeCommunityNodes(), createMemberOfRelationships(), createParentOfHierarchy(), computeInteractsWith(), enrichCommunityMetadata(), getCommunitiesAtLevel(), getChildCommunities(), getCommunityInteractions(), getMaxCommunityLevel()
src/graphrag/CommunitySummarizer.js +146 summarizeHierarchical(), generateParentSummary()
src/graphrag/GraphGuidedRetrieval.js +65 Level-aware retrieveGlobal() with PARENT_OF drill-down and INTERACTS_WITH in output
scripts/run_community_detection.js +77 --materialize flag for full hierarchical pipeline
src/__tests__/CommunityHierarchy.test.js +109 6 integration tests
CHANGELOG.md +26 v7.20.0 entry

Total: 8 files changed, 801 insertions(+), 31 deletions(-)


Significance for the Global Workflow

The NOAA Global Forecast System (GFS) runs operationally every 6 hours, producing weather forecasts that protect lives and property worldwide. The codebase spans:

  • 17,575 Fortran subroutines and functions across GFS, GEFS, UFS, MOM6, CICE6, WW3
  • 3,267 Python functions in workflow automation (pygfs, wxflow)
  • 264 Shell scripts orchestrating J-Jobs across WCOSS2, Hera, Orion, Gaea, Hercules
  • 2,489 environment variables controlling runtime behavior

Understanding how these components organize into subsystems — and critically, how those subsystems interact at their boundaries — is essential for the scientists and engineers evolving this infrastructure in a changing climate.

The hierarchical community structure captures this organization computationally:

  • L0 communities identify tightly-coupled function groups (e.g., a specific physics parameterization)
  • L1 communities cluster these into functional modules (e.g., atmospheric radiation package)
  • L2 communities represent major subsystem components (e.g., data assimilation pipeline)
  • L3 communities capture global workflow subsystems (e.g., atmosphere, ocean, coupling)
  • INTERACTS_WITH edges quantify where subsystems communicate, with strength proportional to the number of cross-boundary function calls, variable references, and module imports

This is no longer a flat bag of code files searchable by text similarity. It is a structural map of one of the most complex computational workflows on Earth.


Current System State (Post-24E-5)

Component Metric
Neo4j 41,355 nodes, 589,396 relationships, 28 label types, 23 edge types
ChromaDB 5 collections, 63,837 documents
MCP Tools 43 tools across 9 modules
Community Hierarchy 1,036 nodes, 4 levels, 828 summaries
GGSR Pipeline LOCAL + GLOBAL + TRACE + HYBRID query routing
Cross-Language Shell → Fortran → Python traversal (65 EXECUTES bridges)
Docker Gateway Port 18888, Streamable HTTP

Remaining GraphRAG Work

Phase Status Description
24E-5 COMPLETE Hierarchical Community Materialization
27H NOT STARTED search_documentation multi-collection routing
24I Planned Learned Graph Embeddings
24J Planned Subgraph Retrieval

Reproducibility

The full hierarchical pipeline can be re-run after any graph change:

cd mcp_server_node

# Re-run with materialization (destroys + rebuilds community structure)
node scripts/run_community_detection.js --materialize

# Run validation tests
npx vitest run src/__tests__/CommunityHierarchy.test.js

"If it's not in the SDD, it doesn't get coded."
Phase 24E-5 was specified, executed, verified, and committed in a single session.