Neo4j GraphRAG Ingestion Pipeline - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki
Neo4j GraphRAG Ingestion Pipeline
How the Global Workflow Knowledge Graph Gets Built β From Source Code to Queryable Graph
This document describes the complete ingestion pipeline that transforms 40,000+ code entities across Fortran, Python, and Shell into a unified Neo4j knowledge graph with 589K+ relationships, plus ChromaDB vector embeddings for semantic search. All scripts live in
mcp_server_node/scripts/.
Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Source Code (global-workflow) β
β sorc/ (Fortran) β ush/ (Python/Shell) β dev/jobs/ (J-Jobs)β
ββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββββββ¬βββββββββββ
β β β
ββββββΌβββββ ββββββΌβββββ ββββββΌβββββ
β Fortran β β Python β β Shell β
β Parser β β AST β β Regex β
β(fparser2)β β(stdlib) β β Parser β
ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ
β β β
ββββββββββββ¬ββββββββββ΄ββββββββββββ¬βββββββββββββ
β β
βββββββΌβββββββ ββββββββΌβββββββ
β Neo4j β β ChromaDB β
β (Graph DB) β β (Vector DB) β
β 41K nodes β β 66K docs β
β 589K rels β β 5 collectionsβ
βββββββ¬βββββββ ββββββββ¬βββββββ
β β
βββββββΌββββββββββββββββββββββΌβββββββ
β MCP Tools (48 registered) β
β GGSR = Graph-Guided Semantic β
β Retrieval (hybrid) β
βββββββββββββββββββββββββββββββββββββ
Pipeline Stages
The ingestion runs in four sequential stages. Each stage builds on the previous one.
Stage 1: Code Graph Ingestion (Neo4j)
These scripts parse source code and create the structural graph β nodes for every code entity and edges for every call, import, and dependency.
1A. Fortran Graph β ingest_fortran_graph.py
| Property | Value |
|---|---|
| Parser | fparser2 (via FortranFileReader β raw strings will fail) |
| Source Dirs | sorc/, sorc/ufs_model.fd, sorc/gsi.fd, sorc/gfs_utils.fd |
| Database | Neo4j only |
| Phase | 10 (Milestone 2) |
Node labels created:
| Label | Description | Example |
|---|---|---|
FortranModule |
Fortran MODULE declarations | module_radlw_main |
FortranSubroutine |
SUBROUTINE definitions | getcon, sfcdrv |
FortranFunction |
FUNCTION definitions | fpvs, plnev_a |
FortranProgram |
Compiled executables (PROGRAM) | gfs_atmos_cubed_sphere |
Relationship types created:
| Relationship | Meaning |
|---|---|
(:caller)-[:CALLS {line}]->(:FortranSubroutine) |
Subroutine/function call |
(:code)-[:USES {only}]->(:FortranModule) |
Fortran USE statement |
(:module)-[:CONTAINS]->(:subroutine|function) |
Module contains definition |
Run: python3 ingest_fortran_graph.py (or --test FILE for single file, --dry-run for validation)
1B. Python Graph β ingest_python_graph.py
| Property | Value |
|---|---|
| Parser | Python stdlib ast module |
| Source Dirs | ush/, dev/, sorc/wxflow, sorc/gdas.cd/ush, sorc/verif-global.fd/ush, sorc/nexus.fd/utils/python |
| Database | Neo4j only |
| Phase | 24F-0 |
Node labels created:
| Label | Description | Example |
|---|---|---|
PythonModule |
Python modules/packages | atm_diag, gfs_tasks |
PythonClass |
Class definitions | GFSForecast, AtmAnalysis |
PythonFunction |
Functions and methods | execute(), initialize() |
Relationship types created:
| Relationship | Meaning |
|---|---|
(:module)-[:DEFINES]->(:function|class) |
Module defines entity |
(:class)-[:INHERITS]->(:PythonClass) |
Class inheritance |
(:caller)-[:CALLS {line}]->(:PythonFunction) |
Function call |
(:module)-[:IMPORTS {type, alias}]->(:PythonModule) |
Import statement |
Run: python3 ingest_python_graph.py (or --test FILE, --dry-run, --sample for 50-file validation)
1C. Shell Graph β ingest_shell_graph_v8.py
| Property | Value |
|---|---|
| Parser | Regex-based (no AST for shell) |
| Source Dirs | dev/jobs/ (J-Jobs), dev/scripts/ (ex-scripts), ush/, scripts/ |
| Database | Neo4j only |
| Phase | 27B |
Node labels created:
| Label | Description | Example |
|---|---|---|
ShellScript |
Shell scripts with category metadata | JGFS_ATMOS_ANALYSIS |
EnvironmentVariable |
Environment variable declarations | HOMEgfs, DATAROOT |
ConfigFile |
Configuration file references | config.base, config.fcst |
Relationship types created:
| Relationship | Meaning |
|---|---|
(:script)-[:SOURCES]->(:other_script) |
source / . directives |
(:script)-[:INVOKES]->(:ex_script) |
Script invocation |
(:script)-[:READS_CONFIG]->(:config) |
Config file reads |
(:script)-[:EXPORTS]->(:env_var) |
export VAR=value |
(:script)-[:DEPENDS_ON_ENV]->(:env_var) |
${VAR} / $VAR usage |
Script categories: j-job, ex-script, ush-script, legacy-script
Run: python3 ingest_shell_graph_v8.py
1D. Environment Variables β ingest_env_variables.py
| Property | Value |
|---|---|
| Source Dirs | dev/jobs, jobs, ush, scripts |
| Database | Neo4j only |
| Phase | 24 Gap 1 |
Supplements the shell graph with detailed environment variable tracking. Detects export VAR=value, plain VAR=value, and all ${VAR} / $VAR references.
Relationship types created:
| Relationship | Meaning |
|---|---|
(:script)-[:EXPORTS]->(:EnvironmentVariable) |
export declaration |
(:script)-[:SETS]->(:EnvironmentVariable) |
assignment without export |
(:script)-[:DEPENDS_ON_ENV]->(:EnvironmentVariable) |
variable reference |
Run: python3 ingest_env_variables.py (or --test FILE, --var VARNAME to query one variable)
1E. Cross-Language Bridges β ingest_cross_language_bridges.py
| Property | Value |
|---|---|
| Database | Neo4j only |
| Phase | 24F-2, 27I, 27J |
This is the critical script that connects the three language graphs together. It finds where shell scripts launch Fortran executables ($EXECgfs/gfs_atmos.x) and invoke Python modules, creating edges that let you trace execution across language boundaries.
Relationship types created:
| Relationship | Meaning |
|---|---|
(:File)-[:EXECUTES {executable, line}]->(:FortranProgram) |
Shell β Fortran executable launch |
(:File)-[:INVOKES {script, line}]->(:PythonModule) |
Shell β Python invocation |
Executable patterns detected:
$EXECgfs/name.x,${HOMEgfs}/exec/name.x- Config-defined:
$FCSTEXEC,$APRUNFC $exec_name - WW3 patterns:
${NET,,}_ww3_*.x - UFS model:
gfs_model.x,gefs_model.x,sfs_model.x
Creates placeholder FortranProgram nodes for external packages (GSI, UFS_UTILS, Fit2Obs, WW3) when the actual Fortran source is not in the repo.
Run: python3 ingest_cross_language_bridges.py (or --dry-run, --verbose)
1F. DocumentationβCode Links β link_docs_to_code.py
| Property | Value |
|---|---|
| Database | Neo4j |
| Phase | Week 3 Plan Phase 3 |
Creates DOC_DESCRIBES relationships between documentation chunks and code entities they reference, using regex pattern matching on file names, function names, job names, and script paths.
Run: python3 link_docs_to_code.py
Stage 2: Code Vector Embeddings (ChromaDB)
These scripts create searchable vector embeddings of source code, enabling semantic code search ("find code that handles precipitation accumulation").
2A. Code Embeddings β ingest_code_v8.py
| Property | Value |
|---|---|
| Embedding Model | all-mpnet-base-v2 (768 dimensions) |
| Collection | code-with-context-v8-0-0 (~58,761 documents) |
| Database | ChromaDB only |
Chunks Python and Shell source code with AST-aware semantic boundaries (200β2000 chars), 3 lines of context before/after. References Neo4j for graph enrichment metadata but does not write to Neo4j.
2B. Graph-Enriched Code Embeddings β ingest_code_graph_enriched_v6.py
| Property | Value |
|---|---|
| Collection | code-graph-v6-0-0 |
| Database | Neo4j + ChromaDB (hybrid) |
Creates enriched embeddings that include dependency context from the Neo4j graph in the embedding metadata β so searching for code finds results weighted by structural importance.
Run: python3 ingest_code_graph_enriched_v6.py --directory PATH --collection NAME
Stage 3: Documentation & Standards Embeddings (ChromaDB)
3A. Documentation β ingest_documentation_v8.py
| Property | Value |
|---|---|
| Sources | 16 web-crawled sites (Global Workflow RTD, EE2, UFS, Rocoto, ecFlow, Spack, JEDI, etc.) |
| Collection | global-workflow-docs-v8-0-0 (~5,409 documents) |
| Database | ChromaDB only |
| Embedding Model | all-mpnet-base-v2 (768 dimensions) |
Source URLs are defined in documentation_sources_config.py (the SPOT source for all documentation targets). Tags documents with priority tiers: tier1_critical, tier2_important, tier3_supplementary.
3B. EE2 Standards β ingest_ee2_enhanced_v5.py
| Property | Value |
|---|---|
| Sources | RST files in sdd_framework/phase2_annotations/ |
| Collection | ee2-standards-v5-0-0-enhanced (~34 documents) |
| Database | ChromaDB only |
Parses custom RST directives (mcp:anti_pattern::, mcp:correct_pattern::, mcp:ai_guidance_rule::, etc.) into structured compliance rules with severity levels and false-positive rates.
3C. J-Jobs β ingest_jjobs_v8.py
| Property | Value |
|---|---|
| Sources | J-Job scripts in dev/jobs/ |
| Collection | jjobs-v8-0-0 (~700 documents) |
| Database | ChromaDB only |
Creates vector embeddings of J-Job scripts with structured metadata (inputs, outputs, calls, configs, env_vars).
3D. Master Re-Ingestion β reingest_all_with_phase2.sh
Orchestration wrapper that runs the full documentation + EE2 pipeline in order:
ingest_documentation_week3.py(standard web crawl)ingest_ee2_enhanced_v5.py(Phase 2 annotations)generatePhase2Config.js(generatesphase2_anti_patterns.json)- Validation checks
Run: ./reingest_all_with_phase2.sh <collection_name>
Use this only for full system refreshes (embedding model changes, not routine updates).
Stage 4: Community Detection & Summarization (Neo4j + ChromaDB)
This is the most advanced stage β it discovers emergent structure in the code graph using graph algorithms, then generates natural-language summaries of each community using LLMs.
Step 1: Community Detection β run_community_detection.js
| Property | Value |
|---|---|
| Algorithm | Leiden (via Neo4j Graph Data Science plugin) |
| Database | Neo4j |
| Phase | 24E |
Runs the GDS Leiden algorithm on the code graph to discover communities of tightly-coupled code entities. With --materialize, creates a 4-level hierarchy:
| Level | Count | Granularity |
|---|---|---|
| L0 | 694 | Fine-grained function clusters |
| L1 | 175 | Module-level groupings |
| L2 | 86 | Subsystem groupings |
| L3 | 81 | Major system components |
Node labels created:
| Label | Properties |
|---|---|
Community |
communityId, level, memberCount, name, languages, keyMembers, summary, summarySource, summaryModel |
Relationship types created:
| Relationship | Meaning |
|---|---|
(:node)-[:MEMBER_OF]->(:Community) |
Entity belongs to community |
(:Community)-[:PARENT_OF]->(:Community) |
Hierarchy (childβparent) |
(:Community)-[:INTERACTS_WITH {strength}]->(:Community) |
Cross-community coupling |
Run: node run_community_detection.js --materialize (full hierarchy)
Step 2: Export Contexts β export_community_contexts.js
Extracts community metadata from Neo4j into data/community_contexts.json for LLM processing. Each community context includes members, internal/external relationships, child summaries, and language distribution.
Run: node export_community_contexts.js
Step 3: Generate LLM Summaries β generate_llm_summaries.js
Sends each community context to GitHub Models API for natural-language summarization. Processes bottom-up (L0 first) so parent communities can reference child summaries. Uses a rotating pool of models (Ministral-3B, Cohere, Llama-3.3-70B, Codestral, Mistral-small, Phi-4). Rate-limited to 12 req/min.
Output: data/llm_summaries.json
Run: node generate_llm_summaries.js (resume-safe β skips already-summarized communities)
Step 4: Import Summaries β import_llm_summaries.js
Imports the generated summaries back into Neo4j (updating Community.summary property) and into ChromaDB (community-summaries collection, ~1,648 documents).
Run: node import_llm_summaries.js (or --skip-chromadb, --skip-neo4j to target one database)
Node.js Ingestion Modules
In addition to the standalone Python scripts, src/ingestion/neo4j/ contains Node.js modules used by the MCP server for on-demand graph operations:
| Module | Purpose |
|---|---|
Neo4jClient.js |
Connection pooling, transactions, query/write interface |
GraphSchema.js |
Complete node/relationship schema definitions |
CodeStructureIngester.js |
Batch code parsing (Python AST, Shell/Fortran regex) |
CMakeGraphIngester.js |
Build system dependency graph from CMakeLists.txt |
SubmoduleGraphIngester.js |
Git submodule dependency mapping |
GitHubGraphIngester.js |
GitHub metadata (issues, PRs, contributors) |
Data Access Layer
The unified data access layer connects Neo4j and ChromaDB to the MCP tools:
| Module | Location | Purpose |
|---|---|---|
GraphDatabase.js |
src/data/ |
Neo4j query interface (query() for reads, write() for mutations) |
VectorDatabase.js |
src/data/ |
ChromaDB v2 API wrapper (6+ collections) |
UnifiedDataAccess.js |
src/data/ |
GGSR β Graph-Guided Semantic Retrieval (hybrid Neo4j + ChromaDB fusion) |
GGSR (Graph-Guided Semantic Retrieval) is the key innovation: when you search for code, GGSR queries ChromaDB for semantic matches, then traverses the Neo4j graph to find structurally related entities, and fuses the results with configurable weight blending.
Current Graph Statistics (March 2026)
Neo4j Node Counts
| Label | Count | Source Script |
|---|---|---|
| FortranSubroutine | ~13,537 | ingest_fortran_graph.py |
| File | ~11,016 | Multiple ingestion scripts |
| PythonFunction | ~3,267 | ingest_python_graph.py |
| EnvironmentVariable | ~2,730 | ingest_env_variables.py + ingest_shell_graph_v8.py |
| Community | ~1,036 | run_community_detection.js |
| ShellScript | ~1,000+ | ingest_shell_graph_v8.py |
| PythonModule | ~624 | ingest_python_graph.py |
| FortranFunction | ~500 | ingest_fortran_graph.py |
| PythonClass | ~248 | ingest_python_graph.py |
| FortranProgram | ~200+ | ingest_fortran_graph.py |
| FortranModule | ~150 | ingest_fortran_graph.py |
| Total | ~41,355 |
Neo4j Relationship Counts
| Type | Count | Meaning |
|---|---|---|
| CALLS | ~439,919 | Function/subroutine calls |
| USES | ~91,285 | Fortran USE statements |
| IMPORTS | ~50,000+ | Python import statements |
| MEMBER_OF | ~21,559 | Community membership |
| DEFINES | ~15,000+ | Module defines entity |
| DEPENDS_ON_ENV | ~6,007 | Script uses env var |
| INTERACTS_WITH | ~5,000+ | Cross-community coupling |
| EXPORTS | ~1,669 | Script exports env var |
| PARENT_OF | ~978 | Community hierarchy |
| EXECUTES | ~16 | Shell β Fortran launch |
| Total | ~589,396 |
ChromaDB Collections
| Collection | Documents | Source Script |
|---|---|---|
code-with-context-v8-0-0 |
58,761 | ingest_code_v8.py |
global-workflow-docs-v8-0-0 |
5,409 | ingest_documentation_v8.py |
community-summaries |
1,648 | import_llm_summaries.js |
jjobs-v8-0-0 |
700 | ingest_jjobs_v8.py |
ee2-standards-v5-0-0-enhanced |
34 | ingest_ee2_enhanced_v5.py |
| Total | ~66,552 |
Embedding Model: all-mpnet-base-v2 (768 dimensions) β all collections must use the same model
Execution Order (Full Rebuild)
For a complete graph rebuild from scratch:
cd mcp_server_node/scripts
# Stage 1: Code Graph (Neo4j) β run in order
python3 ingest_fortran_graph.py # ~10β15 min
python3 ingest_python_graph.py # ~5 min
python3 ingest_shell_graph_v8.py # ~3 min
python3 ingest_env_variables.py # ~2 min
python3 ingest_cross_language_bridges.py # ~1 min
python3 link_docs_to_code.py # ~2 min
# Stage 2: Code Embeddings (ChromaDB)
python3 ingest_code_v8.py # ~15β25 min
# Stage 3: Documentation & Standards (ChromaDB)
python3 ingest_documentation_v8.py # ~10β30 min (web crawl)
python3 ingest_ee2_enhanced_v5.py # ~1 min
python3 ingest_jjobs_v8.py # ~3 min
# Stage 4: Community Detection (Neo4j + ChromaDB)
node run_community_detection.js --materialize # ~2 min
node export_community_contexts.js # ~1 min
node generate_llm_summaries.js # ~30β60 min (LLM rate-limited)
node import_llm_summaries.js # ~1 min
Prerequisites:
- Neo4j 5.x with GDS plugin running on
bolt://localhost:7687 - ChromaDB running on
http://localhost:8080(v2 API) - Spack modules loaded (
module load python/3.11 py-neo4j py-pip) sentence-transformersinstalled (pip install --user sentence-transformers)
MCP Tools That Consume the Graph
| Tool Module | Key Tools | Uses |
|---|---|---|
| CodeAnalysisTools | find_dependencies, find_callers_callees, trace_execution_path, trace_full_execution_chain |
Neo4j graph traversal |
| GraphRAGTools | get_code_context, search_architecture, find_similar_code, trace_data_flow, get_change_impact |
GGSR (Neo4j + ChromaDB) |
| SemanticSearchTools | search_documentation, explain_with_context, find_related_files |
ChromaDB + Neo4j enrichment |
| EE2ComplianceTools | analyze_ee2_compliance, scan_repository_compliance |
ChromaDB (EE2 collection) |
| OperationalTools | get_operational_guidance, explain_workflow_component, get_job_details |
ChromaDB (docs + jjobs) |
| SDDWorkflowTools | validate_sdd_compliance |
Filesystem (SDD framework) |
Related Resources
- GraphRAG-Hierarchical-Community-Materialization β Detailed milestone report on the community detection pipeline
- PHASE_2_HYBRID_ARCHITECTURE_SPECIFICATION β Architecture of the hybrid Neo4j + ChromaDB system
- GitHub-MCP-Tools-installed-for-globalβworkflow-software-development-and-how-they-work β MCP platform overview
- MCP-RAG-Platform-32-Day-Achievement-Synopsis β 32-day achievement summary (v7.10.0 β v7.25.1)
- GraphRAG Dashboard β Interactive Neo4j query dashboard (HTML)