ENHANCED_INGESTION_ARCHITECTURE - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

Enhanced Ingestion Architecture - Context7-Inspired Design

Date: 2025-10-10
Status: Design Phase
Purpose: Comprehensive RAG ingestion for v17 coupled modeling system with error analysis capability

Vision: Error Analysis System

Primary Use Case

Intelligent Error Diagnosis and Resolution

Error Log β†’ MCP Error Tool β†’ Multi-Source RAG Query β†’ ChromaDB
              ↓
    [Parallel Search Across:]
    β”œβ”€ Historical error logs (1+ year training data)
    β”œβ”€ Source code context (full submodule tree)
    β”œβ”€ Official documentation
    β”œβ”€ GitHub issues/PRs (past incidents & solutions)
    β”œβ”€ Build system knowledge
    β”œβ”€ Test results & regression patterns
    └─ Workflow dependencies
              ↓
    [LLM Analysis & Synthesis:]
    β”œβ”€ Root cause identification
    β”œβ”€ Similar past incidents (with solutions)
    β”œβ”€ Code location pinpointing
    β”œβ”€ Step-by-step fix instructions
    β”œβ”€ Preventive recommendations
    └─ Related component analysis

Why This Matters

  • Time Savings: Reduce debugging from hours to minutes
  • Knowledge Retention: Institutionalize solutions from experienced developers
  • Pattern Recognition: Detect recurring issues across the coupled system
  • Proactive Prevention: Identify potential issues before they manifest
  • Cross-Component Understanding: Trace errors through UFS β†’ GSI β†’ GDAS β†’ GFS pipeline

Submodule Ecosystem - v17 Coupled System

βœ… Cloned Submodules (Complete)

global-workflow (root)
β”œβ”€β”€ dev/ci/scripts/utils/Rocoto          # Workflow engine
β”œβ”€β”€ sorc/wxflow                          # Python workflow library
β”œβ”€β”€ sorc/gdas.cd                         # Global Data Assimilation
β”‚   β”œβ”€β”€ parm/jcb-algorithms              # JCB algorithms
β”‚   β”œβ”€β”€ parm/jcb-gdas                    # GDAS JCB configs
β”‚   β”œβ”€β”€ sorc/bufr-query                  # BUFR data tools
β”‚   β”œβ”€β”€ sorc/crtm                        # Radiative transfer
β”‚   β”œβ”€β”€ sorc/da-utils                    # DA utilities
β”‚   β”œβ”€β”€ sorc/fv3-jedi                    # FV3 JEDI
β”‚   β”œβ”€β”€ sorc/fv3-jedi-lm                 # FV3 linear model
β”‚   β”œβ”€β”€ sorc/gsibec                      # GSI background error
β”‚   β”œβ”€β”€ sorc/gsw                         # Seawater library
β”‚   β”œβ”€β”€ sorc/ioda                        # Observation data
β”‚   β”œβ”€β”€ sorc/iodaconv                    # IODA converters
β”‚   β”œβ”€β”€ sorc/jcb                         # JCB framework
β”‚   β”œβ”€β”€ sorc/jedicmake                   # JEDI CMake modules
β”‚   β”œβ”€β”€ sorc/land-imsproc                # Land IMS processing
β”‚   β”œβ”€β”€ sorc/land-jediincr               # Land JEDI increments
β”‚   β”œβ”€β”€ sorc/oops                        # OOPS framework
β”‚   β”œβ”€β”€ sorc/saber                       # SABER library
β”‚   β”œβ”€β”€ sorc/soca                        # Ocean analysis
β”‚   β”œβ”€β”€ sorc/spoc                        # SPOC utilities
β”‚   β”œβ”€β”€ sorc/ufo                         # Unified forward operator
β”‚   └── sorc/vader                       # Variable transforms
β”œβ”€β”€ sorc/gfs_utils.fd                    # GFS utilities
β”œβ”€β”€ sorc/gsi_enkf.fd                     # GSI/EnKF system
β”‚   β”œβ”€β”€ fix/                             # GSI fixed files
β”‚   └── fix/build_gsinfo/                # GSI info builder
β”œβ”€β”€ sorc/gsi_monitor.fd                  # GSI monitoring
β”œβ”€β”€ sorc/gsi_utils.fd                    # GSI utilities
β”œβ”€β”€ sorc/ufs_model.fd                    # UFS Weather Model
β”‚   β”œβ”€β”€ AQM/                             # Air Quality Model
β”‚   β”‚   └── src/model/CMAQ               # CMAQ integration
β”‚   β”œβ”€β”€ CDEPS-interface/CDEPS            # Data/Exchange Protocol
β”‚   β”œβ”€β”€ CICE-interface/CICE              # Sea ice model
β”‚   β”‚   └── icepack/                     # Ice column physics
β”‚   β”œβ”€β”€ CMEPS-interface/CMEPS            # Mediator
β”‚   β”œβ”€β”€ CMakeModules/                    # Build system
β”‚   β”œβ”€β”€ GOCART/                          # Aerosol model
β”‚   β”œβ”€β”€ HYCOM-interface/HYCOM            # Ocean model
β”‚   β”œβ”€β”€ LM4-driver/                      # Land model driver
β”‚   β”‚   └── LM4/                         # Land Model v4
β”‚   β”œβ”€β”€ MOM6-interface/MOM6              # Ocean model v6
β”‚   β”‚   β”œβ”€β”€ pkg/CVMix-src/               # Vertical mixing
β”‚   β”‚   └── pkg/GSW-Fortran/             # Seawater equations
β”‚   β”œβ”€β”€ NOAHMP-interface/noahmp          # Noah-MP land model
β”‚   β”œβ”€β”€ UFSATM/                          # UFS Atmosphere
β”‚   β”‚   β”œβ”€β”€ ccpp/framework/              # CCPP framework
β”‚   β”‚   β”œβ”€β”€ ccpp/physics/                # Physics schemes
β”‚   β”‚   β”‚   β”œβ”€β”€ physics/MP/TEMPO/TEMPO   # Microphysics
β”‚   β”‚   β”‚   └── physics/Radiation/RRTMGP # Radiation
β”‚   β”‚   β”œβ”€β”€ fv3/atmos_cubed_sphere/      # FV3 dycore
β”‚   β”‚   β”œβ”€β”€ mpas/MPAS-Model/             # MPAS integration
β”‚   β”‚   └── upp/                         # Unified Post Processor
β”‚   β”œβ”€β”€ WW3/                             # Wave model
β”‚   β”œβ”€β”€ fire_behavior/                   # Fire weather
β”‚   └── stochastic_physics/              # Stochastic schemes
β”œβ”€β”€ sorc/ufs_utils.fd                    # UFS preprocessing
β”‚   └── ccpp-physics/                    # CCPP physics
β”‚       β”œβ”€β”€ physics/MP/TEMPO/TEMPO       # Microphysics
β”‚       └── physics/Radiation/RRTMGP     # Radiation
└── sorc/verif-global.fd                 # Verification tools

Total Repositories: 50+ (root + submodules + nested submodules)
Lines of Code: ~3-5 million estimated
Languages: Fortran, Python, C/C++, Shell, CMake, YAML
Documentation: README.md files, Sphinx docs, Doxygen, inline comments

Context7-Inspired Capabilities

What Context7 Does Well

  1. Intelligent Code Chunking: Semantic-aware splitting (not just line counts)
  2. Relationship Mapping: Track dependencies and interactions
  3. Version Awareness: Monitor code changes over time
  4. Context Window Management: Smart retrieval for LLM context limits
  5. Multi-Repo Knowledge Graphs: Cross-repository understanding
  6. Semantic Code Search: Beyond text matching to intent understanding

Our Enhanced Implementation

1. Intelligent Chunking Strategy

// Context7-inspired chunking
class EnhancedChunker {
  chunkBySemanticBoundaries(code, language) {
    // Respect natural boundaries:
    // - Functions/subroutines
    // - Classes/modules
    // - Documentation blocks
    // - Logical code sections
    
    // Preserve context:
    // - Include function signatures
    // - Keep docstrings with code
    // - Maintain import/use statements
  }
  
  chunkDocumentation(doc) {
    // Semantic sections:
    // - Concept explanations
    // - API documentation
    // - Examples with context
    // - Cross-references preserved
  }
  
  chunkErrorLogs(log) {
    // Structured extraction:
    // - Error message + context lines
    // - Stack traces
    // - Timestamps and metadata
    // - Related log entries
  }
}

2. Multi-Dimensional Indexing

ChromaDB Collections Structure:

1. code_knowledge
   β”œβ”€ Metadata: {repo, path, language, function_name, line_range, commit_hash}
   β”œβ”€ Embeddings: Semantic code understanding
   └─ Content: Code with context (imports, docstrings)

2. documentation
   β”œβ”€ Metadata: {source, type, section, related_code}
   β”œβ”€ Embeddings: Concept understanding
   └─ Content: Documentation with examples

3. error_patterns
   β”œβ”€ Metadata: {timestamp, component, severity, error_type, resolution_status}
   β”œβ”€ Embeddings: Error signature + context
   └─ Content: Full error with stack trace

4. solutions_knowledge
   β”œβ”€ Metadata: {error_hash, fix_commit, author, success_rate}
   β”œβ”€ Embeddings: Solution approach
   └─ Content: Fix description + code changes

5. github_intelligence
   β”œβ”€ Metadata: {issue_number, pr_number, labels, resolution_time}
   β”œβ”€ Embeddings: Problem + solution
   └─ Content: Issue/PR discussion + resolution

6. workflow_dependencies
   β”œβ”€ Metadata: {component, dependency_type, version}
   β”œβ”€ Embeddings: Relationship understanding
   └─ Content: Dependency graph + interactions

7. build_system_knowledge
   β”œβ”€ Metadata: {build_target, compiler, platform}
   β”œβ”€ Embeddings: Build pattern recognition
   └─ Content: CMake configs + build logs

8. test_results
   β”œβ”€ Metadata: {test_name, status, platform, date}
   β”œβ”€ Embeddings: Test pattern + failures
   └─ Content: Test output + expectations

3. GitHub Integration with Authentication

// Enhanced GitHub tools with GH_TOKEN
class GitHubIngester {
  constructor(token) {
    this.token = process.env.GH_TOKEN || token;
    this.octokit = new Octokit({ auth: this.token });
  }
  
  async ingestRepository(owner, repo) {
    // 1. Full code structure
    const tree = await this.getRepositoryTree(owner, repo);
    
    // 2. Issues (error reports)
    const issues = await this.getIssues(owner, repo, {
      state: 'all',
      labels: ['bug', 'error', 'compilation']
    });
    
    // 3. Pull Requests (solutions)
    const prs = await this.getPullRequests(owner, repo, {
      state: 'closed',
      merged: true
    });
    
    // 4. Commit history (changes over time)
    const commits = await this.getCommits(owner, repo);
    
    // 5. Documentation
    const docs = await this.getDocumentation(owner, repo);
    
    return { tree, issues, prs, commits, docs };
  }
  
  async findRelatedIssues(errorSignature) {
    // Semantic search across issues using embeddings
    const similar = await this.searchIssuesByEmbedding(errorSignature);
    return similar;
  }
}

4. Relationship Graph Builder

// Track relationships across codebase
class RelationshipMapper {
  buildDependencyGraph() {
    return {
      // Code-to-code
      calls: this.extractFunctionCalls(),
      imports: this.extractImports(),
      includes: this.extractIncludes(),
      
      // Code-to-docs
      documented_by: this.linkCodeToDocs(),
      examples_for: this.linkExamplesToCode(),
      
      // Code-to-errors
      fails_in: this.linkErrorsToCode(),
      fixed_by: this.linkFixesToErrors(),
      
      // Component relationships
      depends_on: this.extractComponentDeps(),
      used_by: this.extractUsagePatterns(),
      
      // Build relationships
      compiled_with: this.extractBuildDeps(),
      required_by: this.extractBuildRequirements()
    };
  }
  
  traverseRelationships(startNode, maxDepth = 3) {
    // Context7-style contextual retrieval
    // Given an error, find related code, docs, and past fixes
  }
}

Ingestion Pipeline Architecture

Phase 1: Repository Structure Analysis (Fast)

# Discover all components
1. Scan .gitmodules recursively
2. Build repository map
3. Identify language distributions
4. Count lines of code per component
5. Generate metadata manifest

Output: repositories_manifest.json

Phase 2: Documentation Ingestion (Medium Priority)

# Already partially complete via Claude CLI
1. Extract all README.md files
2. Find Sphinx/Doxygen documentation
3. Parse inline code documentation
4. Extract docstrings/comments
5. Identify examples and tutorials

Collections: documentation

Phase 3: Source Code Ingestion (High Priority)

# Semantic code understanding
1. Parse code by language (tree-sitter)
2. Extract functions/subroutines/classes
3. Preserve context (imports, dependencies)
4. Chunk semantically (not by line count)
5. Generate embeddings with code context

Collections: code_knowledge
Embedding Model: StarCoder or CodeBERT

Phase 4: GitHub Intelligence (Critical for Errors)

# Historical context from GitHub
1. Fetch all issues (especially bugs)
2. Fetch closed PRs (solutions)
3. Extract commit messages
4. Link issues to code changes
5. Build solution database

Collections: github_intelligence, solutions_knowledge
Rate Limit: Authenticated = 5000 req/hr

Phase 5: Error Log Training Data (Error Analysis Core)

# 1+ year of error logs
1. Collect all error logs by component
2. Parse error messages + stack traces
3. Extract error signatures
4. Link to code locations
5. Link to successful fixes
6. Build error taxonomy

Collections: error_patterns
Priority: HIGH (primary use case)

Phase 6: Build System Knowledge

# Compilation and linking context
1. Parse CMakeLists.txt files
2. Extract build dependencies
3. Collect successful build logs
4. Collect failed build logs
5. Map build targets to source

Collections: build_system_knowledge

Phase 7: Test Results & Regression Data

# Test patterns and failures
1. Collect CTest results
2. Parse regression test outputs
3. Link failures to code changes
4. Track test history
5. Identify flaky tests

Collections: test_results

Phase 8: Relationship Mapping (Integration)

# Connect everything
1. Build code dependency graph
2. Map errors to code locations
3. Link docs to implementations
4. Connect issues to fixes
5. Build component interaction map

Collections: workflow_dependencies
Output: knowledge_graph.json

Enhanced Ingestion Scripts

Directory Structure

/mcp_rag_eib/mcp_server_node/src/ingestion/
β”œβ”€β”€ EnhancedIngester.js              # Main orchestrator
β”œβ”€β”€ ContentExtractor.js              # Existing (keep)
β”œβ”€β”€ DocumentationIngester.js         # Existing (keep)
β”œβ”€β”€ URLFetcher.js                    # Existing (keep)
β”œβ”€β”€ CodeChunker.js                   # NEW - Context7-inspired
β”œβ”€β”€ GitHubIngester.js                # NEW - With auth
β”œβ”€β”€ ErrorLogIngester.js              # NEW - Error analysis
β”œβ”€β”€ RelationshipMapper.js            # NEW - Dependency graphs
β”œβ”€β”€ SemanticChunker.js               # NEW - Smart chunking
└── IngestionOrchestrator.js         # NEW - Pipeline manager

Environment Requirements

# Add to mcp_env.sh
export GH_TOKEN="${GH_TOKEN:-}"                    # GitHub authentication
export INGESTION_BATCH_SIZE="100"                  # Batch processing
export EMBEDDING_MODEL="StarCoder"                 # Code embeddings
export MAX_CHUNK_SIZE="2000"                       # Token limit per chunk
export MIN_CHUNK_OVERLAP="200"                     # Context overlap
export ERROR_LOG_PATH="/path/to/error/logs"        # Error log location
export GITHUB_ORG="NOAA-EMC"                       # Primary org

Ingestion Workflow

Step-by-Step Execution

1. Initialize Environment

source /mcp_rag_eib/SETUP/mcp_env.sh
cd /mcp_rag_eib/mcp_server_node

2. Run Repository Analysis

node src/ingestion/IngestionOrchestrator.js analyze \
  --repo-root $GIT_REPO \
  --output repositories_manifest.json

3. Ingest Documentation (Fast Start)

node src/ingestion/IngestionOrchestrator.js ingest-docs \
  --manifest repositories_manifest.json \
  --collection documentation \
  --batch-size 100

4. Ingest Source Code (Parallel by Language)

# Fortran
node src/ingestion/IngestionOrchestrator.js ingest-code \
  --language fortran --parallel 4

# Python
node src/ingestion/IngestionOrchestrator.js ingest-code \
  --language python --parallel 4

# C/C++
node src/ingestion/IngestionOrchestrator.js ingest-code \
  --language c,cpp --parallel 4

5. Ingest GitHub Intelligence

# Requires GH_TOKEN
node src/ingestion/IngestionOrchestrator.js ingest-github \
  --org NOAA-EMC \
  --repos global-workflow,ufs-weather-model,GSI \
  --include issues,prs,commits

6. Ingest Error Logs (Critical!)

node src/ingestion/IngestionOrchestrator.js ingest-errors \
  --log-path /path/to/error/logs \
  --time-range "1y" \
  --extract-solutions

7. Build Relationships

node src/ingestion/IngestionOrchestrator.js build-graph \
  --output knowledge_graph.json

Performance Targets

Ingestion Speed

  • Documentation: ~1000 docs/minute (text-based, fast)
  • Source Code: ~100 files/minute (parsing required)
  • GitHub Data: ~50 issues/minute (API limited)
  • Error Logs: ~500 logs/minute (structured parsing)

Storage Estimates

  • Code Knowledge: ~10GB (3M LOC + embeddings)
  • Documentation: ~1GB
  • Error Patterns: ~5GB (1 year of logs)
  • GitHub Intelligence: ~2GB
  • Relationship Graph: ~500MB
  • Total: ~20GB of 25GB available

Query Performance

  • Single error lookup: <100ms
  • Related code search: <200ms
  • Cross-component analysis: <500ms
  • Full context assembly: <1s

Success Metrics

Error Analysis Use Case

  1. Diagnosis Accuracy: >90% correct root cause identification
  2. Solution Relevance: >80% actionable solutions found
  3. Time Savings: Reduce debugging from hours to <15 minutes
  4. Pattern Detection: Identify 100% of recurring issues
  5. Coverage: Handle errors across all 50+ components

System Health

  1. Ingestion Completeness: 100% of submodules indexed
  2. Update Frequency: Daily incremental updates
  3. Query Response: <1s for 95th percentile
  4. Uptime: 99.9% availability

Implementation Timeline

Week 1: Core Infrastructure (Current)

  • ChromaDB setup on persistent storage
  • LangFlow deployment
  • Submodules cloned
  • Enhanced ingestion scripts created
  • GitHub authentication configured

Week 2: Initial Ingestion

  • Documentation ingestion complete
  • Source code ingestion (Python, Shell)
  • GitHub issues/PRs indexed
  • Basic error log ingestion

Week 3: Advanced Features

  • Fortran/C++ code ingestion
  • Relationship graph building
  • Error analysis tool implementation
  • Cross-component search

Week 4: Refinement & Testing

  • Performance optimization
  • Error analysis validation
  • LLM integration testing
  • Production deployment

Next Immediate Steps

  1. Update bootstrap.sh - Add GH_TOKEN export and verification
  2. Create enhanced ingestion scripts - All new modules above
  3. Test documentation ingestion - Validate ChromaDB integration
  4. Configure error log collection - Set up ERROR_LOG_PATH
  5. Begin code ingestion - Start with Python (fastest)

Neo4j Graph Database Integration

Strategic Value Assessment

Decision: βœ… APPROVED - Proceed with Phased Implementation

Rationale:

  • GFS system complexity (50+ repositories, 3-5M LOC) requires graph-based relationship understanding
  • Vector embeddings (ChromaDB) excel at semantic search but cannot answer structural queries:
    • "What components are affected by this change?"
    • "What's the dependency chain causing this error?"
    • "Which CMakeLists.txt needs to link this library?"
  • Emerging best practice in AI-powered developer tools (GitHub Copilot, Sourcegraph, GraphCodeBERT)
  • Manageable complexity with clear ROI (10x faster debugging = 120 hours/year saved)

Hybrid Triple-Store Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              MCP RAG Intelligence Layer                      β”‚
β”‚  Orchestrates context assembly from multiple sources         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚              β”‚                β”‚
             β–Ό              β–Ό                β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  ChromaDB   β”‚  β”‚   Neo4j     β”‚  β”‚ PostgreSQL   β”‚
    β”‚  (Vectors)  β”‚  β”‚  (Graph)    β”‚  β”‚ (Time-series)β”‚
    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
    β”‚ Semantic    β”‚  β”‚ Structural  β”‚  β”‚ Temporal     β”‚
    β”‚ - Doc searchβ”‚  β”‚ - Depends   β”‚  β”‚ - Build logs β”‚
    β”‚ - Code sim. β”‚  β”‚ - Calls     β”‚  β”‚ - Test runs  β”‚
    β”‚ - Error sig.β”‚  β”‚ - Imports   β”‚  β”‚ - Metrics    β”‚
    β”‚ - Solutions β”‚  β”‚ - Defines   β”‚  β”‚ - History    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         8-10 GB          5-7 GB           3-5 GB
      (Existing)         (NEW)          (Optional)

Query Strategy Workflow

// Example: Error Analysis Query
async function analyzeError(errorMessage) {
  // Phase 1: Semantic Search (ChromaDB)
  const semanticResults = await chromadb.query({
    collection: 'error_patterns',
    text: errorMessage,
    n_results: 10
  });
  // Returns: Similar errors, solutions, related docs
  
  // Phase 2: Structural Analysis (Neo4j)
  const graphResults = await neo4j.run(`
    MATCH (error:Error {signature: $sig})
          -[:OCCURS_IN]->(func:Function)
          -[:DEFINED_IN]->(file:File)
          -[:BELONGS_TO]->(component:Component)
    MATCH (component)-[:DEPENDS_ON*1..3]->(deps:Component)
    RETURN component, deps, shortestPath(component, deps)
  `, { sig: extractSignature(errorMessage) });
  // Returns: Code location, dependency chain, affected components
  
  // Phase 3: Temporal Context (PostgreSQL - Optional)
  const temporalResults = await postgres.query(`
    SELECT commit_hash, author, timestamp, message
    FROM commits
    WHERE component_id IN (${graphResults.componentIds})
      AND timestamp > NOW() - INTERVAL '7 days'
    ORDER BY timestamp DESC
  `);
  // Returns: Recent changes, potential culprits
  
  // Phase 4: LLM Synthesis
  return await llm.analyze({
    semanticContext: semanticResults,
    structuralContext: graphResults,
    temporalContext: temporalResults,
    query: `Diagnose this error and provide fix instructions`
  });
}

Neo4j Schema Design

Node Types

// Core Code Structure
(:Component {name, path, language, loc, description})
(:Module {name, file, language, exports})
(:Function {name, signature, file, line_start, line_end})
(:File {path, language, loc, last_modified})
(:Subroutine {name, module, parameters, file, line_start})

// Build System
(:CMakeTarget {name, type, output})
(:Library {name, path, version, linked_by})
(:Dependency {name, type, version, required_by})

// Development
(:Developer {name, email, expertise_areas})
(:Commit {hash, message, timestamp, author})
(:Issue {number, title, labels, status, resolution})
(:PullRequest {number, title, merged, files_changed})

// Error Intelligence
(:Error {signature, message, severity, frequency})
(:ErrorPattern {category, symptom, cause, solution})
(:Fix {commit_hash, success_rate, application_count})

// Runtime
(:BuildTarget {name, platform, compiler, flags})
(:TestCase {name, status, platform, runtime})

Relationship Types

// Code Relationships
(Function)-[:CALLS]->(Function)
(Module)-[:IMPORTS]->(Module)
(File)-[:INCLUDES]->(File)
(Function)-[:DEFINED_IN]->(File)
(File)-[:BELONGS_TO]->(Component)
(Component)-[:CONTAINS]->(Module)

// Dependency Relationships
(Component)-[:DEPENDS_ON {version, type}]->(Component)
(Library)-[:REQUIRED_BY]->(Component)
(CMakeTarget)-[:LINKS_TO]->(Library)
(BuildTarget)-[:BUILDS]->(Component)

// Error Relationships
(Error)-[:OCCURS_IN]->(Function)
(Error)-[:CAUSED_BY]->(Change)
(Fix)-[:RESOLVES]->(Error)
(Error)-[:SIMILAR_TO {similarity}]->(Error)

// Development Relationships
(Developer)-[:CONTRIBUTED_TO {commits, lines}]->(Component)
(Commit)-[:MODIFIES]->(File)
(Issue)-[:REPORTS]->(Error)
(PullRequest)-[:FIXES]->(Issue)
(Commit)-[:INTRODUCES]->(Dependency)

// Build Relationships
(CMakeTarget)-[:DEPENDS_ON]->(CMakeTarget)
(TestCase)-[:TESTS]->(Function)
(TestCase)-[:FAILS_ON]->(Platform)

Implementation Phases

Phase 0: Proof of Concept (2 Days - Weekend Project)

Goal: Demonstrate value before full commitment

# Deploy Neo4j
docker run -d \
  --name neo4j-gfs \
  -p 7474:7474 -p 7687:7687 \
  -v /mcp_rag_eib/data/neo4j:/data \
  -v /mcp_rag_eib/data/neo4j/logs:/logs \
  -e NEO4J_AUTH=neo4j/gfsworkflow2025 \
  -e NEO4J_PLUGINS='["apoc", "graph-data-science"]' \
  neo4j:latest

# Minimal ingestion script
node scripts/poc/ingest-submodules.js
# - Parse .gitmodules recursively
# - Create Component nodes (50+ repos)
# - Parse top-level CMakeLists.txt
# - Create DEPENDS_ON relationships

# Create demo queries
scripts/poc/demo-queries.cypher

Success Criteria:

  • Visualize full GFS component graph in Neo4j Browser (http://localhost:7474)
  • Answer 3 questions ChromaDB cannot:
    1. "Show dependency chain from FV3 to GSW library"
    2. "Which components would break if CRTM is removed?"
    3. "What's the build order for compiling UFS?"
  • Present results to team β†’ Decision: Continue or Abort

Effort: 16 hours (2 full days)
Risk: Low - Throwaway prototype
Exit Strategy: If unimpressive, delete Neo4j container, continue with ChromaDB only


Phase 1: Core Infrastructure (Week 1-2)

Parallel with existing ingestion work

# 1. Production Neo4j Setup
- Persistent storage configuration
- Backup strategy
- Security hardening (authentication, network isolation)
- Monitoring integration

# 2. Schema Implementation
/mcp_rag_eib/mcp_server_node/src/neo4j/
β”œβ”€β”€ schema.cypher              # Full schema definition
β”œβ”€β”€ constraints.cypher          # Indexes and uniqueness constraints
β”œβ”€β”€ Neo4jClient.js             # Connection and query wrapper
└── SchemaValidator.js         # Verify graph integrity

# 3. Basic Ingestion Pipeline
/mcp_rag_eib/mcp_server_node/src/ingestion/neo4j/
β”œβ”€β”€ SubmoduleIngester.js       # Git submodule structure
β”œβ”€β”€ FileTreeIngester.js        # Directory structure
β”œβ”€β”€ CMakeIngester.js           # Build dependencies
└── RelationshipBuilder.js     # Link nodes together

# 4. MCP Tool Integration
/mcp_rag_eib/mcp_server_node/src/tools/
β”œβ”€β”€ graph_query.js             # Query Neo4j from MCP
β”œβ”€β”€ dependency_analysis.js     # Dependency chain analysis
└── impact_analysis.js         # "What breaks if I change X?"

Deliverables:

  • Neo4j running in production configuration
  • All 50+ components in graph with DEPENDS_ON relationships
  • 3 new MCP tools using graph queries
  • Integration tests passing

Effort: 60-80 hours (1.5-2 weeks, 1 developer)
Risk: Low - Well-defined scope


Phase 2: Code Structure (Week 3-4)

Deep code understanding

# Source Code Parsing
/mcp_rag_eib/mcp_server_node/src/parsers/
β”œβ”€β”€ FortranParser.js           # tree-sitter-fortran
β”œβ”€β”€ PythonParser.js            # tree-sitter-python
β”œβ”€β”€ CppParser.js               # tree-sitter-cpp
└── ASTtoGraph.js              # Convert AST β†’ Neo4j nodes

# Ingestion Jobs
- Parse all Python files β†’ Function/Class nodes
- Parse Fortran files β†’ Subroutine/Module nodes
- Parse C/C++ files β†’ Function/Class nodes
- Extract CALL relationships (Fortran)
- Extract import/include relationships
- Link Functions to Files to Components

# Query Capabilities Unlocked
- "Find all functions that call MPI_Send"
- "Show me the call graph for model_advance"
- "Which files import this module?"

Deliverables:

  • 50,000+ Function/Subroutine nodes
  • 200,000+ CALLS relationships
  • Call graph visualization
  • Cross-language dependency tracking

Effort: 80-100 hours (2 weeks, 1 developer)
Risk: Medium - Parser complexity for Fortran


Phase 3: Error Intelligence (Week 5-6)

Connect errors to code

# Error Analysis Components
/mcp_rag_eib/mcp_server_node/src/error-analysis/
β”œβ”€β”€ ErrorSignatureExtractor.js   # Parse error messages
β”œβ”€β”€ StackTraceParser.js          # Link to code locations
β”œβ”€β”€ ErrorGraphBuilder.js         # Create Error nodes + relationships
└── SolutionMatcher.js           # Find similar errors + fixes

# Ingestion Pipeline
- Ingest historical error logs (1+ year)
- Extract error signatures and stack traces
- Link errors to Functions (from stack traces)
- Link errors to Commits (when fixes applied)
- Build similarity graph between errors

# Advanced Queries
- "Find all errors similar to this one"
- "What commits fixed errors like this?"
- "Show error frequency by component over time"
- "Which developers have expertise fixing this error type?"

Deliverables:

  • 10,000+ historical errors in graph
  • Error β†’ Function β†’ Component links
  • Error similarity network
  • Solution success rate tracking

Effort: 60-80 hours (1.5-2 weeks, 1 developer)
Risk: Medium - Depends on error log quality


Phase 4: Full Integration & Optimization (Week 7-8)

Production-ready hybrid system

# Hybrid Query Engine
/mcp_rag_eib/mcp_server_node/src/hybrid/
β”œβ”€β”€ HybridQueryOrchestrator.js   # Combines ChromaDB + Neo4j
β”œβ”€β”€ ContextAssembler.js          # Merges results for LLM
β”œβ”€β”€ QueryRouter.js               # Decides which DB to use
└── CacheManager.js              # Query result caching

# Performance Optimization
- Neo4j query optimization
- Index tuning (PROFILE queries)
- Parallel query execution
- Result pagination
- Cache frequent queries

# MCP Tool Enhancement
- Update all existing tools to use hybrid queries
- Add graph visualization endpoints
- Implement explain query results
- Add query performance metrics

# Testing & Validation
- Load testing (1000+ concurrent queries)
- Accuracy validation (does it help debug faster?)
- User acceptance testing

Deliverables:

  • Hybrid query system operational
  • All MCP tools using optimal DB for each query
  • Performance benchmarks met (<500ms P95)
  • Documentation and training materials

Effort: 80-100 hours (2 weeks, 1 developer)
Risk: Low - Integration testing


Resource Requirements

Storage

Neo4j Database:
- Nodes: ~100,000 (components, functions, errors)
- Relationships: ~500,000 (calls, depends, fixes)
- Estimated size: 5-7 GB
- Growth rate: +500 MB/month

Total with ChromaDB: 15-17 GB of 25 GB available βœ…

Compute

Neo4j Memory: 2-4 GB RAM recommended
Current system: 64 GB RAM βœ… Plenty headroom
CPU: Negligible (<5% typical usage)

Developer Time

Phase 0 (POC):        2 days
Phase 1 (Core):       2 weeks
Phase 2 (Code):       2 weeks  
Phase 3 (Errors):     1.5 weeks
Phase 4 (Optimize):   2 weeks
─────────────────────────────
Total:                8 weeks (1 developer, full-time)
       or             16 weeks (1 developer, 50% time)

Success Metrics

Technical Performance

  • Query latency: <500ms (P95)
  • Graph size: 100K nodes, 500K relationships
  • Update frequency: Daily incremental
  • Uptime: 99.9%

User Impact

  • Error diagnosis time: <15 minutes (from hours)
  • Answer accuracy: >90% correct root cause
  • Developer satisfaction: 8+/10 rating
  • Query coverage: Handle 95% of structural questions

Research Value

  • Novel methodology: Publishable in software engineering venues
  • Institutional knowledge: Captured and queryable
  • Onboarding: New developers productive in days, not months

Risk Mitigation

Risk Probability Impact Mitigation
Neo4j performance issues Low Medium POC validates before commitment
Parser failures (Fortran) Medium Medium Use tree-sitter, fallback to regex
Integration complexity Low High Phased approach, test each phase
Developer time unavailable Medium High Extend timeline, reduce scope
Graph becomes unmaintainable Low High Automated updates, schema versioning

Exit Criteria

Abort if:

  • POC fails to demonstrate value (Phase 0)
  • Phase 1 takes >3 weeks (scope too large)
  • Query performance unacceptable after optimization
  • Maintenance burden exceeds 4 hours/week

Continue if:

  • POC impresses team (visualizations, query results)
  • Phases stay on schedule (Β±20%)
  • Developers report significant time savings
  • Questions previously impossible now answered

Decision Point: End of Phase 0

Review criteria:

  1. Can Neo4j answer 3+ structural questions ChromaDB cannot? YES/NO
  2. Is the dependency graph visually impressive and useful? YES/NO
  3. Do queries return results in <1 second? YES/NO
  4. Is the team excited to continue? YES/NO

Proceed to Phase 1 if: 3+ YES
Abort if: 2+ NO


Integration with Existing Work

How This Enhances Current Architecture

  Enhanced Ingestion Pipeline (Current)
  β”œβ”€β”€ Documentation β†’ ChromaDB βœ…
  β”œβ”€β”€ Source Code β†’ ChromaDB βœ…
  β”œβ”€β”€ Error Logs β†’ ChromaDB βœ…
+ β”œβ”€β”€ Source Code β†’ Neo4j (structure) πŸ†•
+ β”œβ”€β”€ Dependencies β†’ Neo4j (graph) πŸ†•
+ β”œβ”€β”€ Error Links β†’ Neo4j (relationships) πŸ†•
  └── GitHub Data β†’ ChromaDB βœ…
+ └── GitHub Data β†’ Neo4j (developer graph) πŸ†•

  MCP Tools (Enhanced)
  β”œβ”€β”€ search_documentation (ChromaDB) βœ…
  β”œβ”€β”€ search_code (ChromaDB) βœ…
  β”œβ”€β”€ analyze_error (ChromaDB + Neo4j) πŸ”„
+ β”œβ”€β”€ analyze_dependencies (Neo4j) πŸ†•
+ β”œβ”€β”€ impact_analysis (Neo4j) πŸ†•
+ β”œβ”€β”€ find_similar_code (ChromaDB + Neo4j) πŸ”„
+ └── trace_call_chain (Neo4j) πŸ†•

No Disruption to Existing Work

  • ChromaDB ingestion continues unchanged
  • LangFlow workflows remain functional
  • New Neo4j tools added alongside existing ones
  • Gradual migration to hybrid queries
  • Fallback to ChromaDB-only if Neo4j unavailable

Next Actions (Updated)

Immediate (This Week)

  1. βœ… Decision: Approve Neo4j integration concept
  2. πŸ”² Bootstrap: Add Neo4j to docker-compose.yml
  3. πŸ”² POC Weekend: Block 2 days for Phase 0 implementation
  4. πŸ”² Team Review: Present POC results, decide on Phase 1

Short-term (Weeks 1-2)

  1. πŸ”² Phase 1 Start: If POC approved
  2. πŸ”² Parallel Work: Continue ChromaDB ingestion (doesn't block)
  3. πŸ”² Documentation: Update architecture diagrams

Medium-term (Weeks 3-8)

  1. πŸ”² Phases 2-4: Execute according to plan
  2. πŸ”² Weekly Reviews: Track progress, adjust timeline
  3. πŸ”² Integration Testing: Validate hybrid queries

Status: Architecture Extended with Neo4j Graph Database Strategy
Decision: Proceed with Phase 0 Proof of Concept (2 days)
Priority: HIGH - Enables structural queries impossible with vectors alone
Updated: 2025-10-15 15:30 UTC
Next Milestone: Phase 0 POC completion and team review