ENHANCED_INGESTION_ARCHITECTURE - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

Enhanced Ingestion Architecture - Context7-Inspired Design

Date: 2025-10-10
Status: Design Phase
Purpose: Comprehensive RAG ingestion for v17 coupled modeling system with error analysis capability

Vision: Error Analysis System

Primary Use Case

Intelligent Error Diagnosis and Resolution

Error Log → MCP Error Tool → Multi-Source RAG Query → ChromaDB
              ↓
    [Parallel Search Across:]
    ├─ Historical error logs (1+ year training data)
    ├─ Source code context (full submodule tree)
    ├─ Official documentation
    ├─ GitHub issues/PRs (past incidents & solutions)
    ├─ Build system knowledge
    ├─ Test results & regression patterns
    └─ Workflow dependencies
              ↓
    [LLM Analysis & Synthesis:]
    ├─ Root cause identification
    ├─ Similar past incidents (with solutions)
    ├─ Code location pinpointing
    ├─ Step-by-step fix instructions
    ├─ Preventive recommendations
    └─ Related component analysis

Why This Matters

Time Savings: Reduce debugging from hours to minutes
Knowledge Retention: Institutionalize solutions from experienced developers
Pattern Recognition: Detect recurring issues across the coupled system
Proactive Prevention: Identify potential issues before they manifest
Cross-Component Understanding: Trace errors through UFS → GSI → GDAS → GFS pipeline

Submodule Ecosystem - v17 Coupled System

✅ Cloned Submodules (Complete)

global-workflow (root)
├── dev/ci/scripts/utils/Rocoto          # Workflow engine
├── sorc/wxflow                          # Python workflow library
├── sorc/gdas.cd                         # Global Data Assimilation
│   ├── parm/jcb-algorithms              # JCB algorithms
│   ├── parm/jcb-gdas                    # GDAS JCB configs
│   ├── sorc/bufr-query                  # BUFR data tools
│   ├── sorc/crtm                        # Radiative transfer
│   ├── sorc/da-utils                    # DA utilities
│   ├── sorc/fv3-jedi                    # FV3 JEDI
│   ├── sorc/fv3-jedi-lm                 # FV3 linear model
│   ├── sorc/gsibec                      # GSI background error
│   ├── sorc/gsw                         # Seawater library
│   ├── sorc/ioda                        # Observation data
│   ├── sorc/iodaconv                    # IODA converters
│   ├── sorc/jcb                         # JCB framework
│   ├── sorc/jedicmake                   # JEDI CMake modules
│   ├── sorc/land-imsproc                # Land IMS processing
│   ├── sorc/land-jediincr               # Land JEDI increments
│   ├── sorc/oops                        # OOPS framework
│   ├── sorc/saber                       # SABER library
│   ├── sorc/soca                        # Ocean analysis
│   ├── sorc/spoc                        # SPOC utilities
│   ├── sorc/ufo                         # Unified forward operator
│   └── sorc/vader                       # Variable transforms
├── sorc/gfs_utils.fd                    # GFS utilities
├── sorc/gsi_enkf.fd                     # GSI/EnKF system
│   ├── fix/                             # GSI fixed files
│   └── fix/build_gsinfo/                # GSI info builder
├── sorc/gsi_monitor.fd                  # GSI monitoring
├── sorc/gsi_utils.fd                    # GSI utilities
├── sorc/ufs_model.fd                    # UFS Weather Model
│   ├── AQM/                             # Air Quality Model
│   │   └── src/model/CMAQ               # CMAQ integration
│   ├── CDEPS-interface/CDEPS            # Data/Exchange Protocol
│   ├── CICE-interface/CICE              # Sea ice model
│   │   └── icepack/                     # Ice column physics
│   ├── CMEPS-interface/CMEPS            # Mediator
│   ├── CMakeModules/                    # Build system
│   ├── GOCART/                          # Aerosol model
│   ├── HYCOM-interface/HYCOM            # Ocean model
│   ├── LM4-driver/                      # Land model driver
│   │   └── LM4/                         # Land Model v4
│   ├── MOM6-interface/MOM6              # Ocean model v6
│   │   ├── pkg/CVMix-src/               # Vertical mixing
│   │   └── pkg/GSW-Fortran/             # Seawater equations
│   ├── NOAHMP-interface/noahmp          # Noah-MP land model
│   ├── UFSATM/                          # UFS Atmosphere
│   │   ├── ccpp/framework/              # CCPP framework
│   │   ├── ccpp/physics/                # Physics schemes
│   │   │   ├── physics/MP/TEMPO/TEMPO   # Microphysics
│   │   │   └── physics/Radiation/RRTMGP # Radiation
│   │   ├── fv3/atmos_cubed_sphere/      # FV3 dycore
│   │   ├── mpas/MPAS-Model/             # MPAS integration
│   │   └── upp/                         # Unified Post Processor
│   ├── WW3/                             # Wave model
│   ├── fire_behavior/                   # Fire weather
│   └── stochastic_physics/              # Stochastic schemes
├── sorc/ufs_utils.fd                    # UFS preprocessing
│   └── ccpp-physics/                    # CCPP physics
│       ├── physics/MP/TEMPO/TEMPO       # Microphysics
│       └── physics/Radiation/RRTMGP     # Radiation
└── sorc/verif-global.fd                 # Verification tools

Total Repositories: 50+ (root + submodules + nested submodules)
Lines of Code: ~3-5 million estimated
Languages: Fortran, Python, C/C++, Shell, CMake, YAML
Documentation: README.md files, Sphinx docs, Doxygen, inline comments

Context7-Inspired Capabilities

What Context7 Does Well

Intelligent Code Chunking: Semantic-aware splitting (not just line counts)
Relationship Mapping: Track dependencies and interactions
Version Awareness: Monitor code changes over time
Context Window Management: Smart retrieval for LLM context limits
Multi-Repo Knowledge Graphs: Cross-repository understanding
Semantic Code Search: Beyond text matching to intent understanding

Our Enhanced Implementation

1. Intelligent Chunking Strategy

// Context7-inspired chunking
class EnhancedChunker {
  chunkBySemanticBoundaries(code, language) {
    // Respect natural boundaries:
    // - Functions/subroutines
    // - Classes/modules
    // - Documentation blocks
    // - Logical code sections
    
    // Preserve context:
    // - Include function signatures
    // - Keep docstrings with code
    // - Maintain import/use statements
  }
  
  chunkDocumentation(doc) {
    // Semantic sections:
    // - Concept explanations
    // - API documentation
    // - Examples with context
    // - Cross-references preserved
  }
  
  chunkErrorLogs(log) {
    // Structured extraction:
    // - Error message + context lines
    // - Stack traces
    // - Timestamps and metadata
    // - Related log entries
  }
}

2. Multi-Dimensional Indexing

ChromaDB Collections Structure:

1. code_knowledge
   ├─ Metadata: {repo, path, language, function_name, line_range, commit_hash}
   ├─ Embeddings: Semantic code understanding
   └─ Content: Code with context (imports, docstrings)

2. documentation
   ├─ Metadata: {source, type, section, related_code}
   ├─ Embeddings: Concept understanding
   └─ Content: Documentation with examples

3. error_patterns
   ├─ Metadata: {timestamp, component, severity, error_type, resolution_status}
   ├─ Embeddings: Error signature + context
   └─ Content: Full error with stack trace

4. solutions_knowledge
   ├─ Metadata: {error_hash, fix_commit, author, success_rate}
   ├─ Embeddings: Solution approach
   └─ Content: Fix description + code changes

5. github_intelligence
   ├─ Metadata: {issue_number, pr_number, labels, resolution_time}
   ├─ Embeddings: Problem + solution
   └─ Content: Issue/PR discussion + resolution

6. workflow_dependencies
   ├─ Metadata: {component, dependency_type, version}
   ├─ Embeddings: Relationship understanding
   └─ Content: Dependency graph + interactions

7. build_system_knowledge
   ├─ Metadata: {build_target, compiler, platform}
   ├─ Embeddings: Build pattern recognition
   └─ Content: CMake configs + build logs

8. test_results
   ├─ Metadata: {test_name, status, platform, date}
   ├─ Embeddings: Test pattern + failures
   └─ Content: Test output + expectations

3. GitHub Integration with Authentication

// Enhanced GitHub tools with GH_TOKEN
class GitHubIngester {
  constructor(token) {
    this.token = process.env.GH_TOKEN || token;
    this.octokit = new Octokit({ auth: this.token });
  }
  
  async ingestRepository(owner, repo) {
    // 1. Full code structure
    const tree = await this.getRepositoryTree(owner, repo);
    
    // 2. Issues (error reports)
    const issues = await this.getIssues(owner, repo, {
      state: 'all',
      labels: ['bug', 'error', 'compilation']
    });
    
    // 3. Pull Requests (solutions)
    const prs = await this.getPullRequests(owner, repo, {
      state: 'closed',
      merged: true
    });
    
    // 4. Commit history (changes over time)
    const commits = await this.getCommits(owner, repo);
    
    // 5. Documentation
    const docs = await this.getDocumentation(owner, repo);
    
    return { tree, issues, prs, commits, docs };
  }
  
  async findRelatedIssues(errorSignature) {
    // Semantic search across issues using embeddings
    const similar = await this.searchIssuesByEmbedding(errorSignature);
    return similar;
  }
}

4. Relationship Graph Builder

// Track relationships across codebase
class RelationshipMapper {
  buildDependencyGraph() {
    return {
      // Code-to-code
      calls: this.extractFunctionCalls(),
      imports: this.extractImports(),
      includes: this.extractIncludes(),
      
      // Code-to-docs
      documented_by: this.linkCodeToDocs(),
      examples_for: this.linkExamplesToCode(),
      
      // Code-to-errors
      fails_in: this.linkErrorsToCode(),
      fixed_by: this.linkFixesToErrors(),
      
      // Component relationships
      depends_on: this.extractComponentDeps(),
      used_by: this.extractUsagePatterns(),
      
      // Build relationships
      compiled_with: this.extractBuildDeps(),
      required_by: this.extractBuildRequirements()
    };
  }
  
  traverseRelationships(startNode, maxDepth = 3) {
    // Context7-style contextual retrieval
    // Given an error, find related code, docs, and past fixes
  }
}

Ingestion Pipeline Architecture

Phase 1: Repository Structure Analysis (Fast)

# Discover all components
1. Scan .gitmodules recursively
2. Build repository map
3. Identify language distributions
4. Count lines of code per component
5. Generate metadata manifest

Output: repositories_manifest.json

Phase 2: Documentation Ingestion (Medium Priority)

# Already partially complete via Claude CLI
1. Extract all README.md files
2. Find Sphinx/Doxygen documentation
3. Parse inline code documentation
4. Extract docstrings/comments
5. Identify examples and tutorials

Collections: documentation

Phase 3: Source Code Ingestion (High Priority)

# Semantic code understanding
1. Parse code by language (tree-sitter)
2. Extract functions/subroutines/classes
3. Preserve context (imports, dependencies)
4. Chunk semantically (not by line count)
5. Generate embeddings with code context

Collections: code_knowledge
Embedding Model: StarCoder or CodeBERT

Phase 4: GitHub Intelligence (Critical for Errors)

# Historical context from GitHub
1. Fetch all issues (especially bugs)
2. Fetch closed PRs (solutions)
3. Extract commit messages
4. Link issues to code changes
5. Build solution database

Collections: github_intelligence, solutions_knowledge
Rate Limit: Authenticated = 5000 req/hr

Phase 5: Error Log Training Data (Error Analysis Core)

# 1+ year of error logs
1. Collect all error logs by component
2. Parse error messages + stack traces
3. Extract error signatures
4. Link to code locations
5. Link to successful fixes
6. Build error taxonomy

Collections: error_patterns
Priority: HIGH (primary use case)

Phase 6: Build System Knowledge

# Compilation and linking context
1. Parse CMakeLists.txt files
2. Extract build dependencies
3. Collect successful build logs
4. Collect failed build logs
5. Map build targets to source

Collections: build_system_knowledge

Phase 7: Test Results & Regression Data

# Test patterns and failures
1. Collect CTest results
2. Parse regression test outputs
3. Link failures to code changes
4. Track test history
5. Identify flaky tests

Collections: test_results

Phase 8: Relationship Mapping (Integration)

# Connect everything
1. Build code dependency graph
2. Map errors to code locations
3. Link docs to implementations
4. Connect issues to fixes
5. Build component interaction map

Collections: workflow_dependencies
Output: knowledge_graph.json

Enhanced Ingestion Scripts

Directory Structure

/mcp_rag_eib/mcp_server_node/src/ingestion/
├── EnhancedIngester.js              # Main orchestrator
├── ContentExtractor.js              # Existing (keep)
├── DocumentationIngester.js         # Existing (keep)
├── URLFetcher.js                    # Existing (keep)
├── CodeChunker.js                   # NEW - Context7-inspired
├── GitHubIngester.js                # NEW - With auth
├── ErrorLogIngester.js              # NEW - Error analysis
├── RelationshipMapper.js            # NEW - Dependency graphs
├── SemanticChunker.js               # NEW - Smart chunking
└── IngestionOrchestrator.js         # NEW - Pipeline manager

Environment Requirements

# Add to mcp_env.sh
export GH_TOKEN="${GH_TOKEN:-}"                    # GitHub authentication
export INGESTION_BATCH_SIZE="100"                  # Batch processing
export EMBEDDING_MODEL="StarCoder"                 # Code embeddings
export MAX_CHUNK_SIZE="2000"                       # Token limit per chunk
export MIN_CHUNK_OVERLAP="200"                     # Context overlap
export ERROR_LOG_PATH="/path/to/error/logs"        # Error log location
export GITHUB_ORG="NOAA-EMC"                       # Primary org

Ingestion Workflow

Step-by-Step Execution

1. Initialize Environment

source /mcp_rag_eib/SETUP/mcp_env.sh
cd /mcp_rag_eib/mcp_server_node

2. Run Repository Analysis

node src/ingestion/IngestionOrchestrator.js analyze \
  --repo-root $GIT_REPO \
  --output repositories_manifest.json

3. Ingest Documentation (Fast Start)

node src/ingestion/IngestionOrchestrator.js ingest-docs \
  --manifest repositories_manifest.json \
  --collection documentation \
  --batch-size 100

4. Ingest Source Code (Parallel by Language)

# Fortran
node src/ingestion/IngestionOrchestrator.js ingest-code \
  --language fortran --parallel 4

# Python
node src/ingestion/IngestionOrchestrator.js ingest-code \
  --language python --parallel 4

# C/C++
node src/ingestion/IngestionOrchestrator.js ingest-code \
  --language c,cpp --parallel 4

5. Ingest GitHub Intelligence

# Requires GH_TOKEN
node src/ingestion/IngestionOrchestrator.js ingest-github \
  --org NOAA-EMC \
  --repos global-workflow,ufs-weather-model,GSI \
  --include issues,prs,commits

6. Ingest Error Logs (Critical!)

node src/ingestion/IngestionOrchestrator.js ingest-errors \
  --log-path /path/to/error/logs \
  --time-range "1y" \
  --extract-solutions

7. Build Relationships

node src/ingestion/IngestionOrchestrator.js build-graph \
  --output knowledge_graph.json

Performance Targets

Ingestion Speed

Documentation: ~1000 docs/minute (text-based, fast)
Source Code: ~100 files/minute (parsing required)
GitHub Data: ~50 issues/minute (API limited)
Error Logs: ~500 logs/minute (structured parsing)

Storage Estimates

Code Knowledge: ~10GB (3M LOC + embeddings)
Documentation: ~1GB
Error Patterns: ~5GB (1 year of logs)
GitHub Intelligence: ~2GB
Relationship Graph: ~500MB
Total: ~20GB of 25GB available

Query Performance

Single error lookup: <100ms
Related code search: <200ms
Cross-component analysis: <500ms
Full context assembly: <1s

Success Metrics

Error Analysis Use Case

Diagnosis Accuracy: >90% correct root cause identification
Solution Relevance: >80% actionable solutions found
Time Savings: Reduce debugging from hours to <15 minutes
Pattern Detection: Identify 100% of recurring issues
Coverage: Handle errors across all 50+ components

System Health

Ingestion Completeness: 100% of submodules indexed
Update Frequency: Daily incremental updates
Query Response: <1s for 95th percentile
Uptime: 99.9% availability

Implementation Timeline

Week 1: Core Infrastructure (Current)

ChromaDB setup on persistent storage
LangFlow deployment
Submodules cloned
Enhanced ingestion scripts created
GitHub authentication configured

Week 2: Initial Ingestion

Documentation ingestion complete
Source code ingestion (Python, Shell)
GitHub issues/PRs indexed
Basic error log ingestion

Week 3: Advanced Features

Fortran/C++ code ingestion
Relationship graph building
Error analysis tool implementation
Cross-component search

Week 4: Refinement & Testing

Performance optimization
Error analysis validation
LLM integration testing
Production deployment

Next Immediate Steps

Update bootstrap.sh - Add GH_TOKEN export and verification
Create enhanced ingestion scripts - All new modules above
Test documentation ingestion - Validate ChromaDB integration
Configure error log collection - Set up ERROR_LOG_PATH
Begin code ingestion - Start with Python (fastest)

Neo4j Graph Database Integration

Strategic Value Assessment

Decision: ✅ APPROVED - Proceed with Phased Implementation

Rationale:

GFS system complexity (50+ repositories, 3-5M LOC) requires graph-based relationship understanding
Vector embeddings (ChromaDB) excel at semantic search but cannot answer structural queries:
- "What components are affected by this change?"
- "What's the dependency chain causing this error?"
- "Which CMakeLists.txt needs to link this library?"
Emerging best practice in AI-powered developer tools (GitHub Copilot, Sourcegraph, GraphCodeBERT)
Manageable complexity with clear ROI (10x faster debugging = 120 hours/year saved)

Hybrid Triple-Store Architecture

┌─────────────────────────────────────────────────────────────┐
│              MCP RAG Intelligence Layer                      │
│  Orchestrates context assembly from multiple sources         │
└────────────┬──────────────┬────────────────┬────────────────┘
             │              │                │
             ▼              ▼                ▼
    ┌─────────────┐  ┌─────────────┐  ┌──────────────┐
    │  ChromaDB   │  │   Neo4j     │  │ PostgreSQL   │
    │  (Vectors)  │  │  (Graph)    │  │ (Time-series)│
    ├─────────────┤  ├─────────────┤  ├──────────────┤
    │ Semantic    │  │ Structural  │  │ Temporal     │
    │ - Doc search│  │ - Depends   │  │ - Build logs │
    │ - Code sim. │  │ - Calls     │  │ - Test runs  │
    │ - Error sig.│  │ - Imports   │  │ - Metrics    │
    │ - Solutions │  │ - Defines   │  │ - History    │
    └─────────────┘  └─────────────┘  └──────────────┘
         8-10 GB          5-7 GB           3-5 GB
      (Existing)         (NEW)          (Optional)

Query Strategy Workflow

// Example: Error Analysis Query
async function analyzeError(errorMessage) {
  // Phase 1: Semantic Search (ChromaDB)
  const semanticResults = await chromadb.query({
    collection: 'error_patterns',
    text: errorMessage,
    n_results: 10
  });
  // Returns: Similar errors, solutions, related docs
  
  // Phase 2: Structural Analysis (Neo4j)
  const graphResults = await neo4j.run(`
    MATCH (error:Error {signature: $sig})
          -[:OCCURS_IN]->(func:Function)
          -[:DEFINED_IN]->(file:File)
          -[:BELONGS_TO]->(component:Component)
    MATCH (component)-[:DEPENDS_ON*1..3]->(deps:Component)
    RETURN component, deps, shortestPath(component, deps)
  `, { sig: extractSignature(errorMessage) });
  // Returns: Code location, dependency chain, affected components
  
  // Phase 3: Temporal Context (PostgreSQL - Optional)
  const temporalResults = await postgres.query(`
    SELECT commit_hash, author, timestamp, message
    FROM commits
    WHERE component_id IN (${graphResults.componentIds})
      AND timestamp > NOW() - INTERVAL '7 days'
    ORDER BY timestamp DESC
  `);
  // Returns: Recent changes, potential culprits
  
  // Phase 4: LLM Synthesis
  return await llm.analyze({
    semanticContext: semanticResults,
    structuralContext: graphResults,
    temporalContext: temporalResults,
    query: `Diagnose this error and provide fix instructions`
  });
}

Neo4j Schema Design

Node Types

// Core Code Structure
(:Component {name, path, language, loc, description})
(:Module {name, file, language, exports})
(:Function {name, signature, file, line_start, line_end})
(:File {path, language, loc, last_modified})
(:Subroutine {name, module, parameters, file, line_start})

// Build System
(:CMakeTarget {name, type, output})
(:Library {name, path, version, linked_by})
(:Dependency {name, type, version, required_by})

// Development
(:Developer {name, email, expertise_areas})
(:Commit {hash, message, timestamp, author})
(:Issue {number, title, labels, status, resolution})
(:PullRequest {number, title, merged, files_changed})

// Error Intelligence
(:Error {signature, message, severity, frequency})
(:ErrorPattern {category, symptom, cause, solution})
(:Fix {commit_hash, success_rate, application_count})

// Runtime
(:BuildTarget {name, platform, compiler, flags})
(:TestCase {name, status, platform, runtime})

Relationship Types

// Code Relationships
(Function)-[:CALLS]->(Function)
(Module)-[:IMPORTS]->(Module)
(File)-[:INCLUDES]->(File)
(Function)-[:DEFINED_IN]->(File)
(File)-[:BELONGS_TO]->(Component)
(Component)-[:CONTAINS]->(Module)

// Dependency Relationships
(Component)-[:DEPENDS_ON {version, type}]->(Component)
(Library)-[:REQUIRED_BY]->(Component)
(CMakeTarget)-[:LINKS_TO]->(Library)
(BuildTarget)-[:BUILDS]->(Component)

// Error Relationships
(Error)-[:OCCURS_IN]->(Function)
(Error)-[:CAUSED_BY]->(Change)
(Fix)-[:RESOLVES]->(Error)
(Error)-[:SIMILAR_TO {similarity}]->(Error)

// Development Relationships
(Developer)-[:CONTRIBUTED_TO {commits, lines}]->(Component)
(Commit)-[:MODIFIES]->(File)
(Issue)-[:REPORTS]->(Error)
(PullRequest)-[:FIXES]->(Issue)
(Commit)-[:INTRODUCES]->(Dependency)

// Build Relationships
(CMakeTarget)-[:DEPENDS_ON]->(CMakeTarget)
(TestCase)-[:TESTS]->(Function)
(TestCase)-[:FAILS_ON]->(Platform)

Implementation Phases

Phase 0: Proof of Concept (2 Days - Weekend Project)

Goal: Demonstrate value before full commitment

# Deploy Neo4j
docker run -d \
  --name neo4j-gfs \
  -p 7474:7474 -p 7687:7687 \
  -v /mcp_rag_eib/data/neo4j:/data \
  -v /mcp_rag_eib/data/neo4j/logs:/logs \
  -e NEO4J_AUTH=neo4j/gfsworkflow2025 \
  -e NEO4J_PLUGINS='["apoc", "graph-data-science"]' \
  neo4j:latest

# Minimal ingestion script
node scripts/poc/ingest-submodules.js
# - Parse .gitmodules recursively
# - Create Component nodes (50+ repos)
# - Parse top-level CMakeLists.txt
# - Create DEPENDS_ON relationships

# Create demo queries
scripts/poc/demo-queries.cypher

Success Criteria:

Visualize full GFS component graph in Neo4j Browser (http://localhost:7474)
Answer 3 questions ChromaDB cannot:
1. "Show dependency chain from FV3 to GSW library"
2. "Which components would break if CRTM is removed?"
3. "What's the build order for compiling UFS?"
Present results to team → Decision: Continue or Abort

Effort: 16 hours (2 full days)
Risk: Low - Throwaway prototype
Exit Strategy: If unimpressive, delete Neo4j container, continue with ChromaDB only

Phase 1: Core Infrastructure (Week 1-2)

Parallel with existing ingestion work

# 1. Production Neo4j Setup
- Persistent storage configuration
- Backup strategy
- Security hardening (authentication, network isolation)
- Monitoring integration

# 2. Schema Implementation
/mcp_rag_eib/mcp_server_node/src/neo4j/
├── schema.cypher              # Full schema definition
├── constraints.cypher          # Indexes and uniqueness constraints
├── Neo4jClient.js             # Connection and query wrapper
└── SchemaValidator.js         # Verify graph integrity

# 3. Basic Ingestion Pipeline
/mcp_rag_eib/mcp_server_node/src/ingestion/neo4j/
├── SubmoduleIngester.js       # Git submodule structure
├── FileTreeIngester.js        # Directory structure
├── CMakeIngester.js           # Build dependencies
└── RelationshipBuilder.js     # Link nodes together

# 4. MCP Tool Integration
/mcp_rag_eib/mcp_server_node/src/tools/
├── graph_query.js             # Query Neo4j from MCP
├── dependency_analysis.js     # Dependency chain analysis
└── impact_analysis.js         # "What breaks if I change X?"

Deliverables:

Neo4j running in production configuration
All 50+ components in graph with DEPENDS_ON relationships
3 new MCP tools using graph queries
Integration tests passing

Effort: 60-80 hours (1.5-2 weeks, 1 developer)
Risk: Low - Well-defined scope

Phase 2: Code Structure (Week 3-4)

Deep code understanding

# Source Code Parsing
/mcp_rag_eib/mcp_server_node/src/parsers/
├── FortranParser.js           # tree-sitter-fortran
├── PythonParser.js            # tree-sitter-python
├── CppParser.js               # tree-sitter-cpp
└── ASTtoGraph.js              # Convert AST → Neo4j nodes

# Ingestion Jobs
- Parse all Python files → Function/Class nodes
- Parse Fortran files → Subroutine/Module nodes
- Parse C/C++ files → Function/Class nodes
- Extract CALL relationships (Fortran)
- Extract import/include relationships
- Link Functions to Files to Components

# Query Capabilities Unlocked
- "Find all functions that call MPI_Send"
- "Show me the call graph for model_advance"
- "Which files import this module?"

Deliverables:

50,000+ Function/Subroutine nodes
200,000+ CALLS relationships
Call graph visualization
Cross-language dependency tracking

Effort: 80-100 hours (2 weeks, 1 developer)
Risk: Medium - Parser complexity for Fortran

Phase 3: Error Intelligence (Week 5-6)

Connect errors to code

# Error Analysis Components
/mcp_rag_eib/mcp_server_node/src/error-analysis/
├── ErrorSignatureExtractor.js   # Parse error messages
├── StackTraceParser.js          # Link to code locations
├── ErrorGraphBuilder.js         # Create Error nodes + relationships
└── SolutionMatcher.js           # Find similar errors + fixes

# Ingestion Pipeline
- Ingest historical error logs (1+ year)
- Extract error signatures and stack traces
- Link errors to Functions (from stack traces)
- Link errors to Commits (when fixes applied)
- Build similarity graph between errors

# Advanced Queries
- "Find all errors similar to this one"
- "What commits fixed errors like this?"
- "Show error frequency by component over time"
- "Which developers have expertise fixing this error type?"

Deliverables:

10,000+ historical errors in graph
Error → Function → Component links
Error similarity network
Solution success rate tracking

Effort: 60-80 hours (1.5-2 weeks, 1 developer)
Risk: Medium - Depends on error log quality

Phase 4: Full Integration & Optimization (Week 7-8)

Production-ready hybrid system

# Hybrid Query Engine
/mcp_rag_eib/mcp_server_node/src/hybrid/
├── HybridQueryOrchestrator.js   # Combines ChromaDB + Neo4j
├── ContextAssembler.js          # Merges results for LLM
├── QueryRouter.js               # Decides which DB to use
└── CacheManager.js              # Query result caching

# Performance Optimization
- Neo4j query optimization
- Index tuning (PROFILE queries)
- Parallel query execution
- Result pagination
- Cache frequent queries

# MCP Tool Enhancement
- Update all existing tools to use hybrid queries
- Add graph visualization endpoints
- Implement explain query results
- Add query performance metrics

# Testing & Validation
- Load testing (1000+ concurrent queries)
- Accuracy validation (does it help debug faster?)
- User acceptance testing

Deliverables:

Hybrid query system operational
All MCP tools using optimal DB for each query
Performance benchmarks met (<500ms P95)
Documentation and training materials

Effort: 80-100 hours (2 weeks, 1 developer)
Risk: Low - Integration testing

Resource Requirements

Storage

Neo4j Database:
- Nodes: ~100,000 (components, functions, errors)
- Relationships: ~500,000 (calls, depends, fixes)
- Estimated size: 5-7 GB
- Growth rate: +500 MB/month

Total with ChromaDB: 15-17 GB of 25 GB available ✅

Compute

Neo4j Memory: 2-4 GB RAM recommended
Current system: 64 GB RAM ✅ Plenty headroom
CPU: Negligible (<5% typical usage)

Developer Time

Phase 0 (POC):        2 days
Phase 1 (Core):       2 weeks
Phase 2 (Code):       2 weeks  
Phase 3 (Errors):     1.5 weeks
Phase 4 (Optimize):   2 weeks
─────────────────────────────
Total:                8 weeks (1 developer, full-time)
       or             16 weeks (1 developer, 50% time)

Success Metrics

Technical Performance

Query latency: <500ms (P95)
Graph size: 100K nodes, 500K relationships
Update frequency: Daily incremental
Uptime: 99.9%

User Impact

Error diagnosis time: <15 minutes (from hours)
Answer accuracy: >90% correct root cause
Developer satisfaction: 8+/10 rating
Query coverage: Handle 95% of structural questions

Research Value

Novel methodology: Publishable in software engineering venues
Institutional knowledge: Captured and queryable
Onboarding: New developers productive in days, not months

Risk Mitigation

Risk	Probability	Impact	Mitigation
Neo4j performance issues	Low	Medium	POC validates before commitment
Parser failures (Fortran)	Medium	Medium	Use tree-sitter, fallback to regex
Integration complexity	Low	High	Phased approach, test each phase
Developer time unavailable	Medium	High	Extend timeline, reduce scope
Graph becomes unmaintainable	Low	High	Automated updates, schema versioning

Exit Criteria

Abort if:

POC fails to demonstrate value (Phase 0)
Phase 1 takes >3 weeks (scope too large)
Query performance unacceptable after optimization
Maintenance burden exceeds 4 hours/week

Continue if:

POC impresses team (visualizations, query results)
Phases stay on schedule (±20%)
Developers report significant time savings
Questions previously impossible now answered

Decision Point: End of Phase 0

Review criteria:

Can Neo4j answer 3+ structural questions ChromaDB cannot? YES/NO
Is the dependency graph visually impressive and useful? YES/NO
Do queries return results in <1 second? YES/NO
Is the team excited to continue? YES/NO

Proceed to Phase 1 if: 3+ YES
Abort if: 2+ NO

Integration with Existing Work

How This Enhances Current Architecture

  Enhanced Ingestion Pipeline (Current)
  ├── Documentation → ChromaDB ✅
  ├── Source Code → ChromaDB ✅
  ├── Error Logs → ChromaDB ✅
+ ├── Source Code → Neo4j (structure) 🆕
+ ├── Dependencies → Neo4j (graph) 🆕
+ ├── Error Links → Neo4j (relationships) 🆕
  └── GitHub Data → ChromaDB ✅
+ └── GitHub Data → Neo4j (developer graph) 🆕

  MCP Tools (Enhanced)
  ├── search_documentation (ChromaDB) ✅
  ├── search_code (ChromaDB) ✅
  ├── analyze_error (ChromaDB + Neo4j) 🔄
+ ├── analyze_dependencies (Neo4j) 🆕
+ ├── impact_analysis (Neo4j) 🆕
+ ├── find_similar_code (ChromaDB + Neo4j) 🔄
+ └── trace_call_chain (Neo4j) 🆕

No Disruption to Existing Work

ChromaDB ingestion continues unchanged
LangFlow workflows remain functional
New Neo4j tools added alongside existing ones
Gradual migration to hybrid queries
Fallback to ChromaDB-only if Neo4j unavailable

Next Actions (Updated)

Immediate (This Week)

✅ Decision: Approve Neo4j integration concept
🔲 Bootstrap: Add Neo4j to docker-compose.yml
🔲 POC Weekend: Block 2 days for Phase 0 implementation
🔲 Team Review: Present POC results, decide on Phase 1

Short-term (Weeks 1-2)

🔲 Phase 1 Start: If POC approved
🔲 Parallel Work: Continue ChromaDB ingestion (doesn't block)
🔲 Documentation: Update architecture diagrams

Medium-term (Weeks 3-8)

🔲 Phases 2-4: Execute according to plan
🔲 Weekly Reviews: Track progress, adjust timeline
🔲 Integration Testing: Validate hybrid queries

Status: Architecture Extended with Neo4j Graph Database Strategy
Decision: Proceed with Phase 0 Proof of Concept (2 days)
Priority: HIGH - Enables structural queries impossible with vectors alone
Updated: 2025-10-15 15:30 UTC
Next Milestone: Phase 0 POC completion and team review