Neo4j GraphRAG Ingestion Pipeline - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

Neo4j GraphRAG Ingestion Pipeline

How the Global Workflow Knowledge Graph Gets Built — From Source Code to Queryable Graph

This document describes the complete ingestion pipeline that transforms 40,000+ code entities across Fortran, Python, and Shell into a unified Neo4j knowledge graph with 589K+ relationships, plus ChromaDB vector embeddings for semantic search. All scripts live in mcp_server_node/scripts/.

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                    Source Code (global-workflow)                │
│   sorc/ (Fortran)  │  ush/ (Python/Shell)  │  dev/jobs/ (J-Jobs)│
└────────┬────────────────────┬────────────────────────┬──────────┘
         │                    │                        │
    ┌────▼────┐          ┌────▼────┐              ┌────▼────┐
    │ Fortran │          │ Python  │              │  Shell  │
    │ Parser  │          │  AST    │              │  Regex  │
    │fparser2 │          │(stdlib) │              │ Parser  │
    └────┬────┘          └────┬────┘              └────┬────┘
         │                    │                        │
         └──────────┬─────────┴───────────┬────────────┘
                    │                     │
              ┌─────▼──────┐       ┌──────▼───────┐
              │   Neo4j    │       │  ChromaDB    │
              │ (Graph DB) │       │ (Vector DB)  │
              │ 41K nodes  │       │ 66K docs     │
              │ 589K rels  │       │ 5 collections│
              └─────┬──────┘       └──────┬───────┘
                    │                     │
              ┌─────▼─────────────────────▼───────┐
              │   MCP Tools (48 registered)       │
              │   GGSR = Graph-Guided Semantic    │
              │          Retrieval (hybrid)       │
              └───────────────────────────────────┘

Pipeline Stages

The ingestion runs in four sequential stages. Each stage builds on the previous one.

Stage 1: Code Graph Ingestion (Neo4j)

These scripts parse source code and create the structural graph — nodes for every code entity and edges for every call, import, and dependency.

1A. Fortran Graph — `ingest_fortran_graph.py`

Property	Value
Parser	fparser2 (via FortranFileReader — raw strings will fail)
Source Dirs	`sorc/`, `sorc/ufs_model.fd`, `sorc/gsi.fd`, `sorc/gfs_utils.fd`
Database	Neo4j only
Phase	10 (Milestone 2)

Node labels created:

Label	Description	Example
`FortranModule`	Fortran MODULE declarations	`module_radlw_main`
`FortranSubroutine`	SUBROUTINE definitions	`getcon`, `sfcdrv`
`FortranFunction`	FUNCTION definitions	`fpvs`, `plnev_a`
`FortranProgram`	Compiled executables (PROGRAM)	`gfs_atmos_cubed_sphere`

Relationship types created:

Relationship	Meaning
`(:caller)-[:CALLS {line}]->(:FortranSubroutine)`	Subroutine/function call
`(:code)-[:USES {only}]->(:FortranModule)`	Fortran USE statement
`(:module)-[:CONTAINS]->(:subroutine\|function)`	Module contains definition

Run: python3 ingest_fortran_graph.py (or --test FILE for single file, --dry-run for validation)

1B. Python Graph — `ingest_python_graph.py`

Property	Value
Parser	Python stdlib `ast` module
Source Dirs	`ush/`, `dev/`, `sorc/wxflow`, `sorc/gdas.cd/ush`, `sorc/verif-global.fd/ush`, `sorc/nexus.fd/utils/python`
Database	Neo4j only
Phase	24F-0

Node labels created:

Label	Description	Example
`PythonModule`	Python modules/packages	`atm_diag`, `gfs_tasks`
`PythonClass`	Class definitions	`GFSForecast`, `AtmAnalysis`
`PythonFunction`	Functions and methods	`execute()`, `initialize()`

Relationship types created:

Relationship	Meaning
`(:module)-[:DEFINES]->(:function\|class)`	Module defines entity
`(:class)-[:INHERITS]->(:PythonClass)`	Class inheritance
`(:caller)-[:CALLS {line}]->(:PythonFunction)`	Function call
`(:module)-[:IMPORTS {type, alias}]->(:PythonModule)`	Import statement

Run: python3 ingest_python_graph.py (or --test FILE, --dry-run, --sample for 50-file validation)

1C. Shell Graph — `ingest_shell_graph_v8.py`

Property	Value
Parser	Regex-based (no AST for shell)
Source Dirs	`dev/jobs/` (J-Jobs), `dev/scripts/` (ex-scripts), `ush/`, `scripts/`
Database	Neo4j only
Phase	27B

Node labels created:

Label	Description	Example
`ShellScript`	Shell scripts with category metadata	`JGFS_ATMOS_ANALYSIS`
`EnvironmentVariable`	Environment variable declarations	`HOMEgfs`, `DATAROOT`
`ConfigFile`	Configuration file references	`config.base`, `config.fcst`

Relationship types created:

Relationship	Meaning
`(:script)-[:SOURCES]->(:other_script)`	`source` / `.` directives
`(:script)-[:INVOKES]->(:ex_script)`	Script invocation
`(:script)-[:READS_CONFIG]->(:config)`	Config file reads
`(:script)-[:EXPORTS]->(:env_var)`	`export VAR=value`
`(:script)-[:DEPENDS_ON_ENV]->(:env_var)`	`${VAR}` / `$VAR` usage

Script categories: j-job, ex-script, ush-script, legacy-script

Run: python3 ingest_shell_graph_v8.py

1D. Environment Variables — `ingest_env_variables.py`

Property	Value
Source Dirs	`dev/jobs`, `jobs`, `ush`, `scripts`
Database	Neo4j only
Phase	24 Gap 1

Supplements the shell graph with detailed environment variable tracking. Detects export VAR=value, plain VAR=value, and all ${VAR} / $VAR references.

Relationship types created:

Relationship	Meaning
`(:script)-[:EXPORTS]->(:EnvironmentVariable)`	export declaration
`(:script)-[:SETS]->(:EnvironmentVariable)`	assignment without export
`(:script)-[:DEPENDS_ON_ENV]->(:EnvironmentVariable)`	variable reference

Run: python3 ingest_env_variables.py (or --test FILE, --var VARNAME to query one variable)

1E. Cross-Language Bridges — `ingest_cross_language_bridges.py`

Property	Value
Database	Neo4j only
Phase	24F-2, 27I, 27J

This is the critical script that connects the three language graphs together. It finds where shell scripts launch Fortran executables ($EXECgfs/gfs_atmos.x) and invoke Python modules, creating edges that let you trace execution across language boundaries.

Relationship types created:

Relationship	Meaning
`(:File)-[:EXECUTES {executable, line}]->(:FortranProgram)`	Shell → Fortran executable launch
`(:File)-[:INVOKES {script, line}]->(:PythonModule)`	Shell → Python invocation

Executable patterns detected:

$EXECgfs/name.x, ${HOMEgfs}/exec/name.x
Config-defined: $FCSTEXEC, $APRUNFC $exec_name
WW3 patterns: ${NET,,}_ww3_*.x
UFS model: gfs_model.x, gefs_model.x, sfs_model.x

Creates placeholder FortranProgram nodes for external packages (GSI, UFS_UTILS, Fit2Obs, WW3) when the actual Fortran source is not in the repo.

Run: python3 ingest_cross_language_bridges.py (or --dry-run, --verbose)

1F. Documentation–Code Links — `link_docs_to_code.py`

Property	Value
Database	Neo4j
Phase	Week 3 Plan Phase 3

Creates DOC_DESCRIBES relationships between documentation chunks and code entities they reference, using regex pattern matching on file names, function names, job names, and script paths.

Run: python3 link_docs_to_code.py

Stage 2: Code Vector Embeddings (ChromaDB)

These scripts create searchable vector embeddings of source code, enabling semantic code search ("find code that handles precipitation accumulation").

2A. Code Embeddings — `ingest_code_v8.py`

Property	Value
Embedding Model	`all-mpnet-base-v2` (768 dimensions)
Collection	`code-with-context-v8-0-0` (~58,761 documents)
Database	ChromaDB only

Chunks Python and Shell source code with AST-aware semantic boundaries (200–2000 chars), 3 lines of context before/after. References Neo4j for graph enrichment metadata but does not write to Neo4j.

2B. Graph-Enriched Code Embeddings — `ingest_code_graph_enriched_v6.py`

Property	Value
Collection	`code-graph-v6-0-0`
Database	Neo4j + ChromaDB (hybrid)

Creates enriched embeddings that include dependency context from the Neo4j graph in the embedding metadata — so searching for code finds results weighted by structural importance.

Run: python3 ingest_code_graph_enriched_v6.py --directory PATH --collection NAME

Stage 3: Documentation & Standards Embeddings (ChromaDB)

3A. Documentation — `ingest_documentation_v8.py`

Property	Value
Sources	16 web-crawled sites (Global Workflow RTD, EE2, UFS, Rocoto, ecFlow, Spack, JEDI, etc.)
Collection	`global-workflow-docs-v8-0-0` (~5,409 documents)
Database	ChromaDB only
Embedding Model	`all-mpnet-base-v2` (768 dimensions)

Source URLs are defined in documentation_sources_config.py (the SPOT source for all documentation targets). Tags documents with priority tiers: tier1_critical, tier2_important, tier3_supplementary.

3B. EE2 Standards — `ingest_ee2_enhanced_v5.py`

Property	Value
Sources	RST files in `sdd_framework/phase2_annotations/`
Collection	`ee2-standards-v5-0-0-enhanced` (~34 documents)
Database	ChromaDB only

Parses custom RST directives (mcp:anti_pattern::, mcp:correct_pattern::, mcp:ai_guidance_rule::, etc.) into structured compliance rules with severity levels and false-positive rates.

3C. J-Jobs — `ingest_jjobs_v8.py`

Property	Value
Sources	J-Job scripts in `dev/jobs/`
Collection	`jjobs-v8-0-0` (~700 documents)
Database	ChromaDB only

Creates vector embeddings of J-Job scripts with structured metadata (inputs, outputs, calls, configs, env_vars).

3D. Master Re-Ingestion — `reingest_all_with_phase2.sh`

Orchestration wrapper that runs the full documentation + EE2 pipeline in order:

ingest_documentation_week3.py (standard web crawl)
ingest_ee2_enhanced_v5.py (Phase 2 annotations)
generatePhase2Config.js (generates phase2_anti_patterns.json)
Validation checks

Run: ./reingest_all_with_phase2.sh <collection_name>

Use this only for full system refreshes (embedding model changes, not routine updates).

Stage 4: Community Detection & Summarization (Neo4j + ChromaDB)

This is the most advanced stage — it discovers emergent structure in the code graph using graph algorithms, then generates natural-language summaries of each community using LLMs.

Step 1: Community Detection — `run_community_detection.js`

Property	Value
Algorithm	Leiden (via Neo4j Graph Data Science plugin)
Database	Neo4j
Phase	24E

Runs the GDS Leiden algorithm on the code graph to discover communities of tightly-coupled code entities. With --materialize, creates a 4-level hierarchy:

Level	Count	Granularity
L0	694	Fine-grained function clusters
L1	175	Module-level groupings
L2	86	Subsystem groupings
L3	81	Major system components

Node labels created:

Label	Properties
`Community`	`communityId`, `level`, `memberCount`, `name`, `languages`, `keyMembers`, `summary`, `summarySource`, `summaryModel`

Relationship types created:

Relationship	Meaning
`(:node)-[:MEMBER_OF]->(:Community)`	Entity belongs to community
`(:Community)-[:PARENT_OF]->(:Community)`	Hierarchy (child→parent)
`(:Community)-[:INTERACTS_WITH {strength}]->(:Community)`	Cross-community coupling

Run: node run_community_detection.js --materialize (full hierarchy)

Step 2: Export Contexts — `export_community_contexts.js`

Extracts community metadata from Neo4j into data/community_contexts.json for LLM processing. Each community context includes members, internal/external relationships, child summaries, and language distribution.

Run: node export_community_contexts.js

Step 3: Generate LLM Summaries — `generate_llm_summaries.js`

Sends each community context to GitHub Models API for natural-language summarization. Processes bottom-up (L0 first) so parent communities can reference child summaries. Uses a rotating pool of models (Ministral-3B, Cohere, Llama-3.3-70B, Codestral, Mistral-small, Phi-4). Rate-limited to 12 req/min.

Output: data/llm_summaries.json

Run: node generate_llm_summaries.js (resume-safe — skips already-summarized communities)

Step 4: Import Summaries — `import_llm_summaries.js`

Imports the generated summaries back into Neo4j (updating Community.summary property) and into ChromaDB (community-summaries collection, ~1,648 documents).

Run: node import_llm_summaries.js (or --skip-chromadb, --skip-neo4j to target one database)

Node.js Ingestion Modules

In addition to the standalone Python scripts, src/ingestion/neo4j/ contains Node.js modules used by the MCP server for on-demand graph operations:

Module	Purpose
`Neo4jClient.js`	Connection pooling, transactions, query/write interface
`GraphSchema.js`	Complete node/relationship schema definitions
`CodeStructureIngester.js`	Batch code parsing (Python AST, Shell/Fortran regex)
`CMakeGraphIngester.js`	Build system dependency graph from CMakeLists.txt
`SubmoduleGraphIngester.js`	Git submodule dependency mapping
`GitHubGraphIngester.js`	GitHub metadata (issues, PRs, contributors)

Data Access Layer

The unified data access layer connects Neo4j and ChromaDB to the MCP tools:

Module	Location	Purpose
`GraphDatabase.js`	`src/data/`	Neo4j query interface (`query()` for reads, `write()` for mutations)
`VectorDatabase.js`	`src/data/`	ChromaDB v2 API wrapper (6+ collections)
`UnifiedDataAccess.js`	`src/data/`	GGSR — Graph-Guided Semantic Retrieval (hybrid Neo4j + ChromaDB fusion)

GGSR (Graph-Guided Semantic Retrieval) is the key innovation: when you search for code, GGSR queries ChromaDB for semantic matches, then traverses the Neo4j graph to find structurally related entities, and fuses the results with configurable weight blending.

Current Graph Statistics (March 2026)

Neo4j Node Counts

Label	Count	Source Script
FortranSubroutine	~13,537	`ingest_fortran_graph.py`
File	~11,016	Multiple ingestion scripts
PythonFunction	~3,267	`ingest_python_graph.py`
EnvironmentVariable	~2,730	`ingest_env_variables.py` + `ingest_shell_graph_v8.py`
Community	~1,036	`run_community_detection.js`
ShellScript	~1,000+	`ingest_shell_graph_v8.py`
PythonModule	~624	`ingest_python_graph.py`
FortranFunction	~500	`ingest_fortran_graph.py`
PythonClass	~248	`ingest_python_graph.py`
FortranProgram	~200+	`ingest_fortran_graph.py`
FortranModule	~150	`ingest_fortran_graph.py`
Total	~41,355

Neo4j Relationship Counts

Type	Count	Meaning
CALLS	~439,919	Function/subroutine calls
USES	~91,285	Fortran USE statements
IMPORTS	~50,000+	Python import statements
MEMBER_OF	~21,559	Community membership
DEFINES	~15,000+	Module defines entity
DEPENDS_ON_ENV	~6,007	Script uses env var
INTERACTS_WITH	~5,000+	Cross-community coupling
EXPORTS	~1,669	Script exports env var
PARENT_OF	~978	Community hierarchy
EXECUTES	~16	Shell → Fortran launch
Total	~589,396

ChromaDB Collections

Collection	Documents	Source Script
`code-with-context-v8-0-0`	58,761	`ingest_code_v8.py`
`global-workflow-docs-v8-0-0`	5,409	`ingest_documentation_v8.py`
`community-summaries`	1,648	`import_llm_summaries.js`
`jjobs-v8-0-0`	700	`ingest_jjobs_v8.py`
`ee2-standards-v5-0-0-enhanced`	34	`ingest_ee2_enhanced_v5.py`
Total	~66,552

Embedding Model: all-mpnet-base-v2 (768 dimensions) — all collections must use the same model

Execution Order (Full Rebuild)

For a complete graph rebuild from scratch:

cd mcp_server_node/scripts

# Stage 1: Code Graph (Neo4j) — run in order
python3 ingest_fortran_graph.py          # ~10–15 min
python3 ingest_python_graph.py           # ~5 min
python3 ingest_shell_graph_v8.py         # ~3 min
python3 ingest_env_variables.py          # ~2 min
python3 ingest_cross_language_bridges.py # ~1 min
python3 link_docs_to_code.py             # ~2 min

# Stage 2: Code Embeddings (ChromaDB)
python3 ingest_code_v8.py                # ~15–25 min

# Stage 3: Documentation & Standards (ChromaDB)
python3 ingest_documentation_v8.py       # ~10–30 min (web crawl)
python3 ingest_ee2_enhanced_v5.py        # ~1 min
python3 ingest_jjobs_v8.py              # ~3 min

# Stage 4: Community Detection (Neo4j + ChromaDB)
node run_community_detection.js --materialize   # ~2 min
node export_community_contexts.js               # ~1 min
node generate_llm_summaries.js                  # ~30–60 min (LLM rate-limited)
node import_llm_summaries.js                    # ~1 min

Prerequisites:

Neo4j 5.x with GDS plugin running on bolt://localhost:7687
ChromaDB running on http://localhost:8080 (v2 API)
Spack modules loaded (module load python/3.11 py-neo4j py-pip)
sentence-transformers installed (pip install --user sentence-transformers)

MCP Tools That Consume the Graph

Tool Module	Key Tools	Uses
CodeAnalysisTools	`find_dependencies`, `find_callers_callees`, `trace_execution_path`, `trace_full_execution_chain`	Neo4j graph traversal
GraphRAGTools	`get_code_context`, `search_architecture`, `find_similar_code`, `trace_data_flow`, `get_change_impact`	GGSR (Neo4j + ChromaDB)
SemanticSearchTools	`search_documentation`, `explain_with_context`, `find_related_files`	ChromaDB + Neo4j enrichment
EE2ComplianceTools	`analyze_ee2_compliance`, `scan_repository_compliance`	ChromaDB (EE2 collection)
OperationalTools	`get_operational_guidance`, `explain_workflow_component`, `get_job_details`	ChromaDB (docs + jjobs)
SDDWorkflowTools	`validate_sdd_compliance`	Filesystem (SDD framework)

Related Resources

GraphRAG-Hierarchical-Community-Materialization — Detailed milestone report on the community detection pipeline
PHASE_2_HYBRID_ARCHITECTURE_SPECIFICATION — Architecture of the hybrid Neo4j + ChromaDB system
GitHub-MCP-Tools-installed-for-global‐workflow-software-development-and-how-they-work — MCP platform overview
MCP-RAG-Platform-32-Day-Achievement-Synopsis — 32-day achievement summary (v7.10.0 → v7.25.1)
GraphRAG Dashboard — Interactive Neo4j query dashboard (HTML)

Neo4j GraphRAG Ingestion Pipeline - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

Neo4j GraphRAG Ingestion Pipeline

Architecture Overview

Pipeline Stages

Stage 1: Code Graph Ingestion (Neo4j)

1A. Fortran Graph — ingest_fortran_graph.py

1B. Python Graph — ingest_python_graph.py

1C. Shell Graph — ingest_shell_graph_v8.py

1D. Environment Variables — ingest_env_variables.py

1E. Cross-Language Bridges — ingest_cross_language_bridges.py

1F. Documentation–Code Links — link_docs_to_code.py

Stage 2: Code Vector Embeddings (ChromaDB)

2A. Code Embeddings — ingest_code_v8.py

2B. Graph-Enriched Code Embeddings — ingest_code_graph_enriched_v6.py

Stage 3: Documentation & Standards Embeddings (ChromaDB)

3A. Documentation — ingest_documentation_v8.py

3B. EE2 Standards — ingest_ee2_enhanced_v5.py

3C. J-Jobs — ingest_jjobs_v8.py

3D. Master Re-Ingestion — reingest_all_with_phase2.sh

Stage 4: Community Detection & Summarization (Neo4j + ChromaDB)

Step 1: Community Detection — run_community_detection.js

Step 2: Export Contexts — export_community_contexts.js

Step 3: Generate LLM Summaries — generate_llm_summaries.js

Step 4: Import Summaries — import_llm_summaries.js

Node.js Ingestion Modules

Data Access Layer

Current Graph Statistics (March 2026)

Neo4j Node Counts

Neo4j Relationship Counts

ChromaDB Collections

Execution Order (Full Rebuild)

MCP Tools That Consume the Graph

Related Resources

1A. Fortran Graph — `ingest_fortran_graph.py`

1B. Python Graph — `ingest_python_graph.py`

1C. Shell Graph — `ingest_shell_graph_v8.py`

1D. Environment Variables — `ingest_env_variables.py`

1E. Cross-Language Bridges — `ingest_cross_language_bridges.py`

1F. Documentation–Code Links — `link_docs_to_code.py`

2A. Code Embeddings — `ingest_code_v8.py`

2B. Graph-Enriched Code Embeddings — `ingest_code_graph_enriched_v6.py`

3A. Documentation — `ingest_documentation_v8.py`

3B. EE2 Standards — `ingest_ee2_enhanced_v5.py`

3C. J-Jobs — `ingest_jjobs_v8.py`

3D. Master Re-Ingestion — `reingest_all_with_phase2.sh`

Step 1: Community Detection — `run_community_detection.js`

Step 2: Export Contexts — `export_community_contexts.js`

Step 3: Generate LLM Summaries — `generate_llm_summaries.js`

Step 4: Import Summaries — `import_llm_summaries.js`