Neo4j GraphRAG Ingestion Pipeline - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

Neo4j GraphRAG Ingestion Pipeline

How the Global Workflow Knowledge Graph Gets Built β€” From Source Code to Queryable Graph

This document describes the complete ingestion pipeline that transforms 40,000+ code entities across Fortran, Python, and Shell into a unified Neo4j knowledge graph with 589K+ relationships, plus ChromaDB vector embeddings for semantic search. All scripts live in mcp_server_node/scripts/.


Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Source Code (global-workflow)                 β”‚
β”‚   sorc/ (Fortran)  β”‚  ush/ (Python/Shell)  β”‚  dev/jobs/ (J-Jobs)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                    β”‚                        β”‚
    β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
    β”‚ Fortran β”‚         β”‚ Python  β”‚              β”‚  Shell  β”‚
    β”‚ Parser  β”‚         β”‚  AST    β”‚              β”‚  Regex  β”‚
    β”‚(fparser2)β”‚        β”‚(stdlib) β”‚              β”‚ Parser  β”‚
    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
         β”‚                    β”‚                        β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚                     β”‚
              β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
              β”‚   Neo4j    β”‚       β”‚  ChromaDB   β”‚
              β”‚ (Graph DB) β”‚       β”‚ (Vector DB) β”‚
              β”‚ 41K nodes  β”‚       β”‚ 66K docs    β”‚
              β”‚ 589K rels  β”‚       β”‚ 5 collectionsβ”‚
              β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                    β”‚                     β”‚
              β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
              β”‚   MCP Tools (48 registered)       β”‚
              β”‚   GGSR = Graph-Guided Semantic    β”‚
              β”‚          Retrieval (hybrid)        β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Pipeline Stages

The ingestion runs in four sequential stages. Each stage builds on the previous one.

Stage 1: Code Graph Ingestion (Neo4j)

These scripts parse source code and create the structural graph β€” nodes for every code entity and edges for every call, import, and dependency.

1A. Fortran Graph β€” ingest_fortran_graph.py

Property Value
Parser fparser2 (via FortranFileReader β€” raw strings will fail)
Source Dirs sorc/, sorc/ufs_model.fd, sorc/gsi.fd, sorc/gfs_utils.fd
Database Neo4j only
Phase 10 (Milestone 2)

Node labels created:

Label Description Example
FortranModule Fortran MODULE declarations module_radlw_main
FortranSubroutine SUBROUTINE definitions getcon, sfcdrv
FortranFunction FUNCTION definitions fpvs, plnev_a
FortranProgram Compiled executables (PROGRAM) gfs_atmos_cubed_sphere

Relationship types created:

Relationship Meaning
(:caller)-[:CALLS {line}]->(:FortranSubroutine) Subroutine/function call
(:code)-[:USES {only}]->(:FortranModule) Fortran USE statement
(:module)-[:CONTAINS]->(:subroutine|function) Module contains definition

Run: python3 ingest_fortran_graph.py (or --test FILE for single file, --dry-run for validation)


1B. Python Graph β€” ingest_python_graph.py

Property Value
Parser Python stdlib ast module
Source Dirs ush/, dev/, sorc/wxflow, sorc/gdas.cd/ush, sorc/verif-global.fd/ush, sorc/nexus.fd/utils/python
Database Neo4j only
Phase 24F-0

Node labels created:

Label Description Example
PythonModule Python modules/packages atm_diag, gfs_tasks
PythonClass Class definitions GFSForecast, AtmAnalysis
PythonFunction Functions and methods execute(), initialize()

Relationship types created:

Relationship Meaning
(:module)-[:DEFINES]->(:function|class) Module defines entity
(:class)-[:INHERITS]->(:PythonClass) Class inheritance
(:caller)-[:CALLS {line}]->(:PythonFunction) Function call
(:module)-[:IMPORTS {type, alias}]->(:PythonModule) Import statement

Run: python3 ingest_python_graph.py (or --test FILE, --dry-run, --sample for 50-file validation)


1C. Shell Graph β€” ingest_shell_graph_v8.py

Property Value
Parser Regex-based (no AST for shell)
Source Dirs dev/jobs/ (J-Jobs), dev/scripts/ (ex-scripts), ush/, scripts/
Database Neo4j only
Phase 27B

Node labels created:

Label Description Example
ShellScript Shell scripts with category metadata JGFS_ATMOS_ANALYSIS
EnvironmentVariable Environment variable declarations HOMEgfs, DATAROOT
ConfigFile Configuration file references config.base, config.fcst

Relationship types created:

Relationship Meaning
(:script)-[:SOURCES]->(:other_script) source / . directives
(:script)-[:INVOKES]->(:ex_script) Script invocation
(:script)-[:READS_CONFIG]->(:config) Config file reads
(:script)-[:EXPORTS]->(:env_var) export VAR=value
(:script)-[:DEPENDS_ON_ENV]->(:env_var) ${VAR} / $VAR usage

Script categories: j-job, ex-script, ush-script, legacy-script

Run: python3 ingest_shell_graph_v8.py


1D. Environment Variables β€” ingest_env_variables.py

Property Value
Source Dirs dev/jobs, jobs, ush, scripts
Database Neo4j only
Phase 24 Gap 1

Supplements the shell graph with detailed environment variable tracking. Detects export VAR=value, plain VAR=value, and all ${VAR} / $VAR references.

Relationship types created:

Relationship Meaning
(:script)-[:EXPORTS]->(:EnvironmentVariable) export declaration
(:script)-[:SETS]->(:EnvironmentVariable) assignment without export
(:script)-[:DEPENDS_ON_ENV]->(:EnvironmentVariable) variable reference

Run: python3 ingest_env_variables.py (or --test FILE, --var VARNAME to query one variable)


1E. Cross-Language Bridges β€” ingest_cross_language_bridges.py

Property Value
Database Neo4j only
Phase 24F-2, 27I, 27J

This is the critical script that connects the three language graphs together. It finds where shell scripts launch Fortran executables ($EXECgfs/gfs_atmos.x) and invoke Python modules, creating edges that let you trace execution across language boundaries.

Relationship types created:

Relationship Meaning
(:File)-[:EXECUTES {executable, line}]->(:FortranProgram) Shell β†’ Fortran executable launch
(:File)-[:INVOKES {script, line}]->(:PythonModule) Shell β†’ Python invocation

Executable patterns detected:

  • $EXECgfs/name.x, ${HOMEgfs}/exec/name.x
  • Config-defined: $FCSTEXEC, $APRUNFC $exec_name
  • WW3 patterns: ${NET,,}_ww3_*.x
  • UFS model: gfs_model.x, gefs_model.x, sfs_model.x

Creates placeholder FortranProgram nodes for external packages (GSI, UFS_UTILS, Fit2Obs, WW3) when the actual Fortran source is not in the repo.

Run: python3 ingest_cross_language_bridges.py (or --dry-run, --verbose)


1F. Documentation–Code Links β€” link_docs_to_code.py

Property Value
Database Neo4j
Phase Week 3 Plan Phase 3

Creates DOC_DESCRIBES relationships between documentation chunks and code entities they reference, using regex pattern matching on file names, function names, job names, and script paths.

Run: python3 link_docs_to_code.py


Stage 2: Code Vector Embeddings (ChromaDB)

These scripts create searchable vector embeddings of source code, enabling semantic code search ("find code that handles precipitation accumulation").

2A. Code Embeddings β€” ingest_code_v8.py

Property Value
Embedding Model all-mpnet-base-v2 (768 dimensions)
Collection code-with-context-v8-0-0 (~58,761 documents)
Database ChromaDB only

Chunks Python and Shell source code with AST-aware semantic boundaries (200–2000 chars), 3 lines of context before/after. References Neo4j for graph enrichment metadata but does not write to Neo4j.

2B. Graph-Enriched Code Embeddings β€” ingest_code_graph_enriched_v6.py

Property Value
Collection code-graph-v6-0-0
Database Neo4j + ChromaDB (hybrid)

Creates enriched embeddings that include dependency context from the Neo4j graph in the embedding metadata β€” so searching for code finds results weighted by structural importance.

Run: python3 ingest_code_graph_enriched_v6.py --directory PATH --collection NAME


Stage 3: Documentation & Standards Embeddings (ChromaDB)

3A. Documentation β€” ingest_documentation_v8.py

Property Value
Sources 16 web-crawled sites (Global Workflow RTD, EE2, UFS, Rocoto, ecFlow, Spack, JEDI, etc.)
Collection global-workflow-docs-v8-0-0 (~5,409 documents)
Database ChromaDB only
Embedding Model all-mpnet-base-v2 (768 dimensions)

Source URLs are defined in documentation_sources_config.py (the SPOT source for all documentation targets). Tags documents with priority tiers: tier1_critical, tier2_important, tier3_supplementary.

3B. EE2 Standards β€” ingest_ee2_enhanced_v5.py

Property Value
Sources RST files in sdd_framework/phase2_annotations/
Collection ee2-standards-v5-0-0-enhanced (~34 documents)
Database ChromaDB only

Parses custom RST directives (mcp:anti_pattern::, mcp:correct_pattern::, mcp:ai_guidance_rule::, etc.) into structured compliance rules with severity levels and false-positive rates.

3C. J-Jobs β€” ingest_jjobs_v8.py

Property Value
Sources J-Job scripts in dev/jobs/
Collection jjobs-v8-0-0 (~700 documents)
Database ChromaDB only

Creates vector embeddings of J-Job scripts with structured metadata (inputs, outputs, calls, configs, env_vars).

3D. Master Re-Ingestion β€” reingest_all_with_phase2.sh

Orchestration wrapper that runs the full documentation + EE2 pipeline in order:

  1. ingest_documentation_week3.py (standard web crawl)
  2. ingest_ee2_enhanced_v5.py (Phase 2 annotations)
  3. generatePhase2Config.js (generates phase2_anti_patterns.json)
  4. Validation checks

Run: ./reingest_all_with_phase2.sh <collection_name>

Use this only for full system refreshes (embedding model changes, not routine updates).


Stage 4: Community Detection & Summarization (Neo4j + ChromaDB)

This is the most advanced stage β€” it discovers emergent structure in the code graph using graph algorithms, then generates natural-language summaries of each community using LLMs.

Step 1: Community Detection β€” run_community_detection.js

Property Value
Algorithm Leiden (via Neo4j Graph Data Science plugin)
Database Neo4j
Phase 24E

Runs the GDS Leiden algorithm on the code graph to discover communities of tightly-coupled code entities. With --materialize, creates a 4-level hierarchy:

Level Count Granularity
L0 694 Fine-grained function clusters
L1 175 Module-level groupings
L2 86 Subsystem groupings
L3 81 Major system components

Node labels created:

Label Properties
Community communityId, level, memberCount, name, languages, keyMembers, summary, summarySource, summaryModel

Relationship types created:

Relationship Meaning
(:node)-[:MEMBER_OF]->(:Community) Entity belongs to community
(:Community)-[:PARENT_OF]->(:Community) Hierarchy (child→parent)
(:Community)-[:INTERACTS_WITH {strength}]->(:Community) Cross-community coupling

Run: node run_community_detection.js --materialize (full hierarchy)

Step 2: Export Contexts β€” export_community_contexts.js

Extracts community metadata from Neo4j into data/community_contexts.json for LLM processing. Each community context includes members, internal/external relationships, child summaries, and language distribution.

Run: node export_community_contexts.js

Step 3: Generate LLM Summaries β€” generate_llm_summaries.js

Sends each community context to GitHub Models API for natural-language summarization. Processes bottom-up (L0 first) so parent communities can reference child summaries. Uses a rotating pool of models (Ministral-3B, Cohere, Llama-3.3-70B, Codestral, Mistral-small, Phi-4). Rate-limited to 12 req/min.

Output: data/llm_summaries.json

Run: node generate_llm_summaries.js (resume-safe β€” skips already-summarized communities)

Step 4: Import Summaries β€” import_llm_summaries.js

Imports the generated summaries back into Neo4j (updating Community.summary property) and into ChromaDB (community-summaries collection, ~1,648 documents).

Run: node import_llm_summaries.js (or --skip-chromadb, --skip-neo4j to target one database)


Node.js Ingestion Modules

In addition to the standalone Python scripts, src/ingestion/neo4j/ contains Node.js modules used by the MCP server for on-demand graph operations:

Module Purpose
Neo4jClient.js Connection pooling, transactions, query/write interface
GraphSchema.js Complete node/relationship schema definitions
CodeStructureIngester.js Batch code parsing (Python AST, Shell/Fortran regex)
CMakeGraphIngester.js Build system dependency graph from CMakeLists.txt
SubmoduleGraphIngester.js Git submodule dependency mapping
GitHubGraphIngester.js GitHub metadata (issues, PRs, contributors)

Data Access Layer

The unified data access layer connects Neo4j and ChromaDB to the MCP tools:

Module Location Purpose
GraphDatabase.js src/data/ Neo4j query interface (query() for reads, write() for mutations)
VectorDatabase.js src/data/ ChromaDB v2 API wrapper (6+ collections)
UnifiedDataAccess.js src/data/ GGSR β€” Graph-Guided Semantic Retrieval (hybrid Neo4j + ChromaDB fusion)

GGSR (Graph-Guided Semantic Retrieval) is the key innovation: when you search for code, GGSR queries ChromaDB for semantic matches, then traverses the Neo4j graph to find structurally related entities, and fuses the results with configurable weight blending.


Current Graph Statistics (March 2026)

Neo4j Node Counts

Label Count Source Script
FortranSubroutine ~13,537 ingest_fortran_graph.py
File ~11,016 Multiple ingestion scripts
PythonFunction ~3,267 ingest_python_graph.py
EnvironmentVariable ~2,730 ingest_env_variables.py + ingest_shell_graph_v8.py
Community ~1,036 run_community_detection.js
ShellScript ~1,000+ ingest_shell_graph_v8.py
PythonModule ~624 ingest_python_graph.py
FortranFunction ~500 ingest_fortran_graph.py
PythonClass ~248 ingest_python_graph.py
FortranProgram ~200+ ingest_fortran_graph.py
FortranModule ~150 ingest_fortran_graph.py
Total ~41,355

Neo4j Relationship Counts

Type Count Meaning
CALLS ~439,919 Function/subroutine calls
USES ~91,285 Fortran USE statements
IMPORTS ~50,000+ Python import statements
MEMBER_OF ~21,559 Community membership
DEFINES ~15,000+ Module defines entity
DEPENDS_ON_ENV ~6,007 Script uses env var
INTERACTS_WITH ~5,000+ Cross-community coupling
EXPORTS ~1,669 Script exports env var
PARENT_OF ~978 Community hierarchy
EXECUTES ~16 Shell β†’ Fortran launch
Total ~589,396

ChromaDB Collections

Collection Documents Source Script
code-with-context-v8-0-0 58,761 ingest_code_v8.py
global-workflow-docs-v8-0-0 5,409 ingest_documentation_v8.py
community-summaries 1,648 import_llm_summaries.js
jjobs-v8-0-0 700 ingest_jjobs_v8.py
ee2-standards-v5-0-0-enhanced 34 ingest_ee2_enhanced_v5.py
Total ~66,552

Embedding Model: all-mpnet-base-v2 (768 dimensions) β€” all collections must use the same model


Execution Order (Full Rebuild)

For a complete graph rebuild from scratch:

cd mcp_server_node/scripts

# Stage 1: Code Graph (Neo4j) β€” run in order
python3 ingest_fortran_graph.py          # ~10–15 min
python3 ingest_python_graph.py           # ~5 min
python3 ingest_shell_graph_v8.py         # ~3 min
python3 ingest_env_variables.py          # ~2 min
python3 ingest_cross_language_bridges.py # ~1 min
python3 link_docs_to_code.py             # ~2 min

# Stage 2: Code Embeddings (ChromaDB)
python3 ingest_code_v8.py                # ~15–25 min

# Stage 3: Documentation & Standards (ChromaDB)
python3 ingest_documentation_v8.py       # ~10–30 min (web crawl)
python3 ingest_ee2_enhanced_v5.py        # ~1 min
python3 ingest_jjobs_v8.py              # ~3 min

# Stage 4: Community Detection (Neo4j + ChromaDB)
node run_community_detection.js --materialize   # ~2 min
node export_community_contexts.js               # ~1 min
node generate_llm_summaries.js                  # ~30–60 min (LLM rate-limited)
node import_llm_summaries.js                    # ~1 min

Prerequisites:

  • Neo4j 5.x with GDS plugin running on bolt://localhost:7687
  • ChromaDB running on http://localhost:8080 (v2 API)
  • Spack modules loaded (module load python/3.11 py-neo4j py-pip)
  • sentence-transformers installed (pip install --user sentence-transformers)

MCP Tools That Consume the Graph

Tool Module Key Tools Uses
CodeAnalysisTools find_dependencies, find_callers_callees, trace_execution_path, trace_full_execution_chain Neo4j graph traversal
GraphRAGTools get_code_context, search_architecture, find_similar_code, trace_data_flow, get_change_impact GGSR (Neo4j + ChromaDB)
SemanticSearchTools search_documentation, explain_with_context, find_related_files ChromaDB + Neo4j enrichment
EE2ComplianceTools analyze_ee2_compliance, scan_repository_compliance ChromaDB (EE2 collection)
OperationalTools get_operational_guidance, explain_workflow_component, get_job_details ChromaDB (docs + jjobs)
SDDWorkflowTools validate_sdd_compliance Filesystem (SDD framework)

Related Resources