ENHANCED_INGESTION_ARCHITECTURE - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki
Enhanced Ingestion Architecture - Context7-Inspired Design
Date: 2025-10-10
Status: Design Phase
Purpose: Comprehensive RAG ingestion for v17 coupled modeling system with error analysis capability
Vision: Error Analysis System
Primary Use Case
Intelligent Error Diagnosis and Resolution
Error Log β MCP Error Tool β Multi-Source RAG Query β ChromaDB
β
[Parallel Search Across:]
ββ Historical error logs (1+ year training data)
ββ Source code context (full submodule tree)
ββ Official documentation
ββ GitHub issues/PRs (past incidents & solutions)
ββ Build system knowledge
ββ Test results & regression patterns
ββ Workflow dependencies
β
[LLM Analysis & Synthesis:]
ββ Root cause identification
ββ Similar past incidents (with solutions)
ββ Code location pinpointing
ββ Step-by-step fix instructions
ββ Preventive recommendations
ββ Related component analysis
Why This Matters
- Time Savings: Reduce debugging from hours to minutes
- Knowledge Retention: Institutionalize solutions from experienced developers
- Pattern Recognition: Detect recurring issues across the coupled system
- Proactive Prevention: Identify potential issues before they manifest
- Cross-Component Understanding: Trace errors through UFS β GSI β GDAS β GFS pipeline
Submodule Ecosystem - v17 Coupled System
β Cloned Submodules (Complete)
global-workflow (root)
βββ dev/ci/scripts/utils/Rocoto # Workflow engine
βββ sorc/wxflow # Python workflow library
βββ sorc/gdas.cd # Global Data Assimilation
β βββ parm/jcb-algorithms # JCB algorithms
β βββ parm/jcb-gdas # GDAS JCB configs
β βββ sorc/bufr-query # BUFR data tools
β βββ sorc/crtm # Radiative transfer
β βββ sorc/da-utils # DA utilities
β βββ sorc/fv3-jedi # FV3 JEDI
β βββ sorc/fv3-jedi-lm # FV3 linear model
β βββ sorc/gsibec # GSI background error
β βββ sorc/gsw # Seawater library
β βββ sorc/ioda # Observation data
β βββ sorc/iodaconv # IODA converters
β βββ sorc/jcb # JCB framework
β βββ sorc/jedicmake # JEDI CMake modules
β βββ sorc/land-imsproc # Land IMS processing
β βββ sorc/land-jediincr # Land JEDI increments
β βββ sorc/oops # OOPS framework
β βββ sorc/saber # SABER library
β βββ sorc/soca # Ocean analysis
β βββ sorc/spoc # SPOC utilities
β βββ sorc/ufo # Unified forward operator
β βββ sorc/vader # Variable transforms
βββ sorc/gfs_utils.fd # GFS utilities
βββ sorc/gsi_enkf.fd # GSI/EnKF system
β βββ fix/ # GSI fixed files
β βββ fix/build_gsinfo/ # GSI info builder
βββ sorc/gsi_monitor.fd # GSI monitoring
βββ sorc/gsi_utils.fd # GSI utilities
βββ sorc/ufs_model.fd # UFS Weather Model
β βββ AQM/ # Air Quality Model
β β βββ src/model/CMAQ # CMAQ integration
β βββ CDEPS-interface/CDEPS # Data/Exchange Protocol
β βββ CICE-interface/CICE # Sea ice model
β β βββ icepack/ # Ice column physics
β βββ CMEPS-interface/CMEPS # Mediator
β βββ CMakeModules/ # Build system
β βββ GOCART/ # Aerosol model
β βββ HYCOM-interface/HYCOM # Ocean model
β βββ LM4-driver/ # Land model driver
β β βββ LM4/ # Land Model v4
β βββ MOM6-interface/MOM6 # Ocean model v6
β β βββ pkg/CVMix-src/ # Vertical mixing
β β βββ pkg/GSW-Fortran/ # Seawater equations
β βββ NOAHMP-interface/noahmp # Noah-MP land model
β βββ UFSATM/ # UFS Atmosphere
β β βββ ccpp/framework/ # CCPP framework
β β βββ ccpp/physics/ # Physics schemes
β β β βββ physics/MP/TEMPO/TEMPO # Microphysics
β β β βββ physics/Radiation/RRTMGP # Radiation
β β βββ fv3/atmos_cubed_sphere/ # FV3 dycore
β β βββ mpas/MPAS-Model/ # MPAS integration
β β βββ upp/ # Unified Post Processor
β βββ WW3/ # Wave model
β βββ fire_behavior/ # Fire weather
β βββ stochastic_physics/ # Stochastic schemes
βββ sorc/ufs_utils.fd # UFS preprocessing
β βββ ccpp-physics/ # CCPP physics
β βββ physics/MP/TEMPO/TEMPO # Microphysics
β βββ physics/Radiation/RRTMGP # Radiation
βββ sorc/verif-global.fd # Verification tools
Total Repositories: 50+ (root + submodules + nested submodules)
Lines of Code: ~3-5 million estimated
Languages: Fortran, Python, C/C++, Shell, CMake, YAML
Documentation: README.md files, Sphinx docs, Doxygen, inline comments
Context7-Inspired Capabilities
What Context7 Does Well
- Intelligent Code Chunking: Semantic-aware splitting (not just line counts)
- Relationship Mapping: Track dependencies and interactions
- Version Awareness: Monitor code changes over time
- Context Window Management: Smart retrieval for LLM context limits
- Multi-Repo Knowledge Graphs: Cross-repository understanding
- Semantic Code Search: Beyond text matching to intent understanding
Our Enhanced Implementation
1. Intelligent Chunking Strategy
// Context7-inspired chunking
class EnhancedChunker {
chunkBySemanticBoundaries(code, language) {
// Respect natural boundaries:
// - Functions/subroutines
// - Classes/modules
// - Documentation blocks
// - Logical code sections
// Preserve context:
// - Include function signatures
// - Keep docstrings with code
// - Maintain import/use statements
}
chunkDocumentation(doc) {
// Semantic sections:
// - Concept explanations
// - API documentation
// - Examples with context
// - Cross-references preserved
}
chunkErrorLogs(log) {
// Structured extraction:
// - Error message + context lines
// - Stack traces
// - Timestamps and metadata
// - Related log entries
}
}
2. Multi-Dimensional Indexing
ChromaDB Collections Structure:
1. code_knowledge
ββ Metadata: {repo, path, language, function_name, line_range, commit_hash}
ββ Embeddings: Semantic code understanding
ββ Content: Code with context (imports, docstrings)
2. documentation
ββ Metadata: {source, type, section, related_code}
ββ Embeddings: Concept understanding
ββ Content: Documentation with examples
3. error_patterns
ββ Metadata: {timestamp, component, severity, error_type, resolution_status}
ββ Embeddings: Error signature + context
ββ Content: Full error with stack trace
4. solutions_knowledge
ββ Metadata: {error_hash, fix_commit, author, success_rate}
ββ Embeddings: Solution approach
ββ Content: Fix description + code changes
5. github_intelligence
ββ Metadata: {issue_number, pr_number, labels, resolution_time}
ββ Embeddings: Problem + solution
ββ Content: Issue/PR discussion + resolution
6. workflow_dependencies
ββ Metadata: {component, dependency_type, version}
ββ Embeddings: Relationship understanding
ββ Content: Dependency graph + interactions
7. build_system_knowledge
ββ Metadata: {build_target, compiler, platform}
ββ Embeddings: Build pattern recognition
ββ Content: CMake configs + build logs
8. test_results
ββ Metadata: {test_name, status, platform, date}
ββ Embeddings: Test pattern + failures
ββ Content: Test output + expectations
3. GitHub Integration with Authentication
// Enhanced GitHub tools with GH_TOKEN
class GitHubIngester {
constructor(token) {
this.token = process.env.GH_TOKEN || token;
this.octokit = new Octokit({ auth: this.token });
}
async ingestRepository(owner, repo) {
// 1. Full code structure
const tree = await this.getRepositoryTree(owner, repo);
// 2. Issues (error reports)
const issues = await this.getIssues(owner, repo, {
state: 'all',
labels: ['bug', 'error', 'compilation']
});
// 3. Pull Requests (solutions)
const prs = await this.getPullRequests(owner, repo, {
state: 'closed',
merged: true
});
// 4. Commit history (changes over time)
const commits = await this.getCommits(owner, repo);
// 5. Documentation
const docs = await this.getDocumentation(owner, repo);
return { tree, issues, prs, commits, docs };
}
async findRelatedIssues(errorSignature) {
// Semantic search across issues using embeddings
const similar = await this.searchIssuesByEmbedding(errorSignature);
return similar;
}
}
4. Relationship Graph Builder
// Track relationships across codebase
class RelationshipMapper {
buildDependencyGraph() {
return {
// Code-to-code
calls: this.extractFunctionCalls(),
imports: this.extractImports(),
includes: this.extractIncludes(),
// Code-to-docs
documented_by: this.linkCodeToDocs(),
examples_for: this.linkExamplesToCode(),
// Code-to-errors
fails_in: this.linkErrorsToCode(),
fixed_by: this.linkFixesToErrors(),
// Component relationships
depends_on: this.extractComponentDeps(),
used_by: this.extractUsagePatterns(),
// Build relationships
compiled_with: this.extractBuildDeps(),
required_by: this.extractBuildRequirements()
};
}
traverseRelationships(startNode, maxDepth = 3) {
// Context7-style contextual retrieval
// Given an error, find related code, docs, and past fixes
}
}
Ingestion Pipeline Architecture
Phase 1: Repository Structure Analysis (Fast)
# Discover all components
1. Scan .gitmodules recursively
2. Build repository map
3. Identify language distributions
4. Count lines of code per component
5. Generate metadata manifest
Output: repositories_manifest.json
Phase 2: Documentation Ingestion (Medium Priority)
# Already partially complete via Claude CLI
1. Extract all README.md files
2. Find Sphinx/Doxygen documentation
3. Parse inline code documentation
4. Extract docstrings/comments
5. Identify examples and tutorials
Collections: documentation
Phase 3: Source Code Ingestion (High Priority)
# Semantic code understanding
1. Parse code by language (tree-sitter)
2. Extract functions/subroutines/classes
3. Preserve context (imports, dependencies)
4. Chunk semantically (not by line count)
5. Generate embeddings with code context
Collections: code_knowledge
Embedding Model: StarCoder or CodeBERT
Phase 4: GitHub Intelligence (Critical for Errors)
# Historical context from GitHub
1. Fetch all issues (especially bugs)
2. Fetch closed PRs (solutions)
3. Extract commit messages
4. Link issues to code changes
5. Build solution database
Collections: github_intelligence, solutions_knowledge
Rate Limit: Authenticated = 5000 req/hr
Phase 5: Error Log Training Data (Error Analysis Core)
# 1+ year of error logs
1. Collect all error logs by component
2. Parse error messages + stack traces
3. Extract error signatures
4. Link to code locations
5. Link to successful fixes
6. Build error taxonomy
Collections: error_patterns
Priority: HIGH (primary use case)
Phase 6: Build System Knowledge
# Compilation and linking context
1. Parse CMakeLists.txt files
2. Extract build dependencies
3. Collect successful build logs
4. Collect failed build logs
5. Map build targets to source
Collections: build_system_knowledge
Phase 7: Test Results & Regression Data
# Test patterns and failures
1. Collect CTest results
2. Parse regression test outputs
3. Link failures to code changes
4. Track test history
5. Identify flaky tests
Collections: test_results
Phase 8: Relationship Mapping (Integration)
# Connect everything
1. Build code dependency graph
2. Map errors to code locations
3. Link docs to implementations
4. Connect issues to fixes
5. Build component interaction map
Collections: workflow_dependencies
Output: knowledge_graph.json
Enhanced Ingestion Scripts
Directory Structure
/mcp_rag_eib/mcp_server_node/src/ingestion/
βββ EnhancedIngester.js # Main orchestrator
βββ ContentExtractor.js # Existing (keep)
βββ DocumentationIngester.js # Existing (keep)
βββ URLFetcher.js # Existing (keep)
βββ CodeChunker.js # NEW - Context7-inspired
βββ GitHubIngester.js # NEW - With auth
βββ ErrorLogIngester.js # NEW - Error analysis
βββ RelationshipMapper.js # NEW - Dependency graphs
βββ SemanticChunker.js # NEW - Smart chunking
βββ IngestionOrchestrator.js # NEW - Pipeline manager
Environment Requirements
# Add to mcp_env.sh
export GH_TOKEN="${GH_TOKEN:-}" # GitHub authentication
export INGESTION_BATCH_SIZE="100" # Batch processing
export EMBEDDING_MODEL="StarCoder" # Code embeddings
export MAX_CHUNK_SIZE="2000" # Token limit per chunk
export MIN_CHUNK_OVERLAP="200" # Context overlap
export ERROR_LOG_PATH="/path/to/error/logs" # Error log location
export GITHUB_ORG="NOAA-EMC" # Primary org
Ingestion Workflow
Step-by-Step Execution
1. Initialize Environment
source /mcp_rag_eib/SETUP/mcp_env.sh
cd /mcp_rag_eib/mcp_server_node
2. Run Repository Analysis
node src/ingestion/IngestionOrchestrator.js analyze \
--repo-root $GIT_REPO \
--output repositories_manifest.json
3. Ingest Documentation (Fast Start)
node src/ingestion/IngestionOrchestrator.js ingest-docs \
--manifest repositories_manifest.json \
--collection documentation \
--batch-size 100
4. Ingest Source Code (Parallel by Language)
# Fortran
node src/ingestion/IngestionOrchestrator.js ingest-code \
--language fortran --parallel 4
# Python
node src/ingestion/IngestionOrchestrator.js ingest-code \
--language python --parallel 4
# C/C++
node src/ingestion/IngestionOrchestrator.js ingest-code \
--language c,cpp --parallel 4
5. Ingest GitHub Intelligence
# Requires GH_TOKEN
node src/ingestion/IngestionOrchestrator.js ingest-github \
--org NOAA-EMC \
--repos global-workflow,ufs-weather-model,GSI \
--include issues,prs,commits
6. Ingest Error Logs (Critical!)
node src/ingestion/IngestionOrchestrator.js ingest-errors \
--log-path /path/to/error/logs \
--time-range "1y" \
--extract-solutions
7. Build Relationships
node src/ingestion/IngestionOrchestrator.js build-graph \
--output knowledge_graph.json
Performance Targets
Ingestion Speed
- Documentation: ~1000 docs/minute (text-based, fast)
- Source Code: ~100 files/minute (parsing required)
- GitHub Data: ~50 issues/minute (API limited)
- Error Logs: ~500 logs/minute (structured parsing)
Storage Estimates
- Code Knowledge: ~10GB (3M LOC + embeddings)
- Documentation: ~1GB
- Error Patterns: ~5GB (1 year of logs)
- GitHub Intelligence: ~2GB
- Relationship Graph: ~500MB
- Total: ~20GB of 25GB available
Query Performance
- Single error lookup: <100ms
- Related code search: <200ms
- Cross-component analysis: <500ms
- Full context assembly: <1s
Success Metrics
Error Analysis Use Case
- Diagnosis Accuracy: >90% correct root cause identification
- Solution Relevance: >80% actionable solutions found
- Time Savings: Reduce debugging from hours to <15 minutes
- Pattern Detection: Identify 100% of recurring issues
- Coverage: Handle errors across all 50+ components
System Health
- Ingestion Completeness: 100% of submodules indexed
- Update Frequency: Daily incremental updates
- Query Response: <1s for 95th percentile
- Uptime: 99.9% availability
Implementation Timeline
Week 1: Core Infrastructure (Current)
- ChromaDB setup on persistent storage
- LangFlow deployment
- Submodules cloned
- Enhanced ingestion scripts created
- GitHub authentication configured
Week 2: Initial Ingestion
- Documentation ingestion complete
- Source code ingestion (Python, Shell)
- GitHub issues/PRs indexed
- Basic error log ingestion
Week 3: Advanced Features
- Fortran/C++ code ingestion
- Relationship graph building
- Error analysis tool implementation
- Cross-component search
Week 4: Refinement & Testing
- Performance optimization
- Error analysis validation
- LLM integration testing
- Production deployment
Next Immediate Steps
- Update bootstrap.sh - Add GH_TOKEN export and verification
- Create enhanced ingestion scripts - All new modules above
- Test documentation ingestion - Validate ChromaDB integration
- Configure error log collection - Set up ERROR_LOG_PATH
- Begin code ingestion - Start with Python (fastest)
Neo4j Graph Database Integration
Strategic Value Assessment
Decision: β APPROVED - Proceed with Phased Implementation
Rationale:
- GFS system complexity (50+ repositories, 3-5M LOC) requires graph-based relationship understanding
- Vector embeddings (ChromaDB) excel at semantic search but cannot answer structural queries:
- "What components are affected by this change?"
- "What's the dependency chain causing this error?"
- "Which CMakeLists.txt needs to link this library?"
- Emerging best practice in AI-powered developer tools (GitHub Copilot, Sourcegraph, GraphCodeBERT)
- Manageable complexity with clear ROI (10x faster debugging = 120 hours/year saved)
Hybrid Triple-Store Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MCP RAG Intelligence Layer β
β Orchestrates context assembly from multiple sources β
ββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββ¬βββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ ββββββββββββββββ
β ChromaDB β β Neo4j β β PostgreSQL β
β (Vectors) β β (Graph) β β (Time-series)β
βββββββββββββββ€ βββββββββββββββ€ ββββββββββββββββ€
β Semantic β β Structural β β Temporal β
β - Doc searchβ β - Depends β β - Build logs β
β - Code sim. β β - Calls β β - Test runs β
β - Error sig.β β - Imports β β - Metrics β
β - Solutions β β - Defines β β - History β
βββββββββββββββ βββββββββββββββ ββββββββββββββββ
8-10 GB 5-7 GB 3-5 GB
(Existing) (NEW) (Optional)
Query Strategy Workflow
// Example: Error Analysis Query
async function analyzeError(errorMessage) {
// Phase 1: Semantic Search (ChromaDB)
const semanticResults = await chromadb.query({
collection: 'error_patterns',
text: errorMessage,
n_results: 10
});
// Returns: Similar errors, solutions, related docs
// Phase 2: Structural Analysis (Neo4j)
const graphResults = await neo4j.run(`
MATCH (error:Error {signature: $sig})
-[:OCCURS_IN]->(func:Function)
-[:DEFINED_IN]->(file:File)
-[:BELONGS_TO]->(component:Component)
MATCH (component)-[:DEPENDS_ON*1..3]->(deps:Component)
RETURN component, deps, shortestPath(component, deps)
`, { sig: extractSignature(errorMessage) });
// Returns: Code location, dependency chain, affected components
// Phase 3: Temporal Context (PostgreSQL - Optional)
const temporalResults = await postgres.query(`
SELECT commit_hash, author, timestamp, message
FROM commits
WHERE component_id IN (${graphResults.componentIds})
AND timestamp > NOW() - INTERVAL '7 days'
ORDER BY timestamp DESC
`);
// Returns: Recent changes, potential culprits
// Phase 4: LLM Synthesis
return await llm.analyze({
semanticContext: semanticResults,
structuralContext: graphResults,
temporalContext: temporalResults,
query: `Diagnose this error and provide fix instructions`
});
}
Neo4j Schema Design
Node Types
// Core Code Structure
(:Component {name, path, language, loc, description})
(:Module {name, file, language, exports})
(:Function {name, signature, file, line_start, line_end})
(:File {path, language, loc, last_modified})
(:Subroutine {name, module, parameters, file, line_start})
// Build System
(:CMakeTarget {name, type, output})
(:Library {name, path, version, linked_by})
(:Dependency {name, type, version, required_by})
// Development
(:Developer {name, email, expertise_areas})
(:Commit {hash, message, timestamp, author})
(:Issue {number, title, labels, status, resolution})
(:PullRequest {number, title, merged, files_changed})
// Error Intelligence
(:Error {signature, message, severity, frequency})
(:ErrorPattern {category, symptom, cause, solution})
(:Fix {commit_hash, success_rate, application_count})
// Runtime
(:BuildTarget {name, platform, compiler, flags})
(:TestCase {name, status, platform, runtime})
Relationship Types
// Code Relationships
(Function)-[:CALLS]->(Function)
(Module)-[:IMPORTS]->(Module)
(File)-[:INCLUDES]->(File)
(Function)-[:DEFINED_IN]->(File)
(File)-[:BELONGS_TO]->(Component)
(Component)-[:CONTAINS]->(Module)
// Dependency Relationships
(Component)-[:DEPENDS_ON {version, type}]->(Component)
(Library)-[:REQUIRED_BY]->(Component)
(CMakeTarget)-[:LINKS_TO]->(Library)
(BuildTarget)-[:BUILDS]->(Component)
// Error Relationships
(Error)-[:OCCURS_IN]->(Function)
(Error)-[:CAUSED_BY]->(Change)
(Fix)-[:RESOLVES]->(Error)
(Error)-[:SIMILAR_TO {similarity}]->(Error)
// Development Relationships
(Developer)-[:CONTRIBUTED_TO {commits, lines}]->(Component)
(Commit)-[:MODIFIES]->(File)
(Issue)-[:REPORTS]->(Error)
(PullRequest)-[:FIXES]->(Issue)
(Commit)-[:INTRODUCES]->(Dependency)
// Build Relationships
(CMakeTarget)-[:DEPENDS_ON]->(CMakeTarget)
(TestCase)-[:TESTS]->(Function)
(TestCase)-[:FAILS_ON]->(Platform)
Implementation Phases
Phase 0: Proof of Concept (2 Days - Weekend Project)
Goal: Demonstrate value before full commitment
# Deploy Neo4j
docker run -d \
--name neo4j-gfs \
-p 7474:7474 -p 7687:7687 \
-v /mcp_rag_eib/data/neo4j:/data \
-v /mcp_rag_eib/data/neo4j/logs:/logs \
-e NEO4J_AUTH=neo4j/gfsworkflow2025 \
-e NEO4J_PLUGINS='["apoc", "graph-data-science"]' \
neo4j:latest
# Minimal ingestion script
node scripts/poc/ingest-submodules.js
# - Parse .gitmodules recursively
# - Create Component nodes (50+ repos)
# - Parse top-level CMakeLists.txt
# - Create DEPENDS_ON relationships
# Create demo queries
scripts/poc/demo-queries.cypher
Success Criteria:
- Visualize full GFS component graph in Neo4j Browser (http://localhost:7474)
- Answer 3 questions ChromaDB cannot:
- "Show dependency chain from FV3 to GSW library"
- "Which components would break if CRTM is removed?"
- "What's the build order for compiling UFS?"
- Present results to team β Decision: Continue or Abort
Effort: 16 hours (2 full days)
Risk: Low - Throwaway prototype
Exit Strategy: If unimpressive, delete Neo4j container, continue with ChromaDB only
Phase 1: Core Infrastructure (Week 1-2)
Parallel with existing ingestion work
# 1. Production Neo4j Setup
- Persistent storage configuration
- Backup strategy
- Security hardening (authentication, network isolation)
- Monitoring integration
# 2. Schema Implementation
/mcp_rag_eib/mcp_server_node/src/neo4j/
βββ schema.cypher # Full schema definition
βββ constraints.cypher # Indexes and uniqueness constraints
βββ Neo4jClient.js # Connection and query wrapper
βββ SchemaValidator.js # Verify graph integrity
# 3. Basic Ingestion Pipeline
/mcp_rag_eib/mcp_server_node/src/ingestion/neo4j/
βββ SubmoduleIngester.js # Git submodule structure
βββ FileTreeIngester.js # Directory structure
βββ CMakeIngester.js # Build dependencies
βββ RelationshipBuilder.js # Link nodes together
# 4. MCP Tool Integration
/mcp_rag_eib/mcp_server_node/src/tools/
βββ graph_query.js # Query Neo4j from MCP
βββ dependency_analysis.js # Dependency chain analysis
βββ impact_analysis.js # "What breaks if I change X?"
Deliverables:
- Neo4j running in production configuration
- All 50+ components in graph with DEPENDS_ON relationships
- 3 new MCP tools using graph queries
- Integration tests passing
Effort: 60-80 hours (1.5-2 weeks, 1 developer)
Risk: Low - Well-defined scope
Phase 2: Code Structure (Week 3-4)
Deep code understanding
# Source Code Parsing
/mcp_rag_eib/mcp_server_node/src/parsers/
βββ FortranParser.js # tree-sitter-fortran
βββ PythonParser.js # tree-sitter-python
βββ CppParser.js # tree-sitter-cpp
βββ ASTtoGraph.js # Convert AST β Neo4j nodes
# Ingestion Jobs
- Parse all Python files β Function/Class nodes
- Parse Fortran files β Subroutine/Module nodes
- Parse C/C++ files β Function/Class nodes
- Extract CALL relationships (Fortran)
- Extract import/include relationships
- Link Functions to Files to Components
# Query Capabilities Unlocked
- "Find all functions that call MPI_Send"
- "Show me the call graph for model_advance"
- "Which files import this module?"
Deliverables:
- 50,000+ Function/Subroutine nodes
- 200,000+ CALLS relationships
- Call graph visualization
- Cross-language dependency tracking
Effort: 80-100 hours (2 weeks, 1 developer)
Risk: Medium - Parser complexity for Fortran
Phase 3: Error Intelligence (Week 5-6)
Connect errors to code
# Error Analysis Components
/mcp_rag_eib/mcp_server_node/src/error-analysis/
βββ ErrorSignatureExtractor.js # Parse error messages
βββ StackTraceParser.js # Link to code locations
βββ ErrorGraphBuilder.js # Create Error nodes + relationships
βββ SolutionMatcher.js # Find similar errors + fixes
# Ingestion Pipeline
- Ingest historical error logs (1+ year)
- Extract error signatures and stack traces
- Link errors to Functions (from stack traces)
- Link errors to Commits (when fixes applied)
- Build similarity graph between errors
# Advanced Queries
- "Find all errors similar to this one"
- "What commits fixed errors like this?"
- "Show error frequency by component over time"
- "Which developers have expertise fixing this error type?"
Deliverables:
- 10,000+ historical errors in graph
- Error β Function β Component links
- Error similarity network
- Solution success rate tracking
Effort: 60-80 hours (1.5-2 weeks, 1 developer)
Risk: Medium - Depends on error log quality
Phase 4: Full Integration & Optimization (Week 7-8)
Production-ready hybrid system
# Hybrid Query Engine
/mcp_rag_eib/mcp_server_node/src/hybrid/
βββ HybridQueryOrchestrator.js # Combines ChromaDB + Neo4j
βββ ContextAssembler.js # Merges results for LLM
βββ QueryRouter.js # Decides which DB to use
βββ CacheManager.js # Query result caching
# Performance Optimization
- Neo4j query optimization
- Index tuning (PROFILE queries)
- Parallel query execution
- Result pagination
- Cache frequent queries
# MCP Tool Enhancement
- Update all existing tools to use hybrid queries
- Add graph visualization endpoints
- Implement explain query results
- Add query performance metrics
# Testing & Validation
- Load testing (1000+ concurrent queries)
- Accuracy validation (does it help debug faster?)
- User acceptance testing
Deliverables:
- Hybrid query system operational
- All MCP tools using optimal DB for each query
- Performance benchmarks met (<500ms P95)
- Documentation and training materials
Effort: 80-100 hours (2 weeks, 1 developer)
Risk: Low - Integration testing
Resource Requirements
Storage
Neo4j Database:
- Nodes: ~100,000 (components, functions, errors)
- Relationships: ~500,000 (calls, depends, fixes)
- Estimated size: 5-7 GB
- Growth rate: +500 MB/month
Total with ChromaDB: 15-17 GB of 25 GB available β
Compute
Neo4j Memory: 2-4 GB RAM recommended
Current system: 64 GB RAM β
Plenty headroom
CPU: Negligible (<5% typical usage)
Developer Time
Phase 0 (POC): 2 days
Phase 1 (Core): 2 weeks
Phase 2 (Code): 2 weeks
Phase 3 (Errors): 1.5 weeks
Phase 4 (Optimize): 2 weeks
βββββββββββββββββββββββββββββ
Total: 8 weeks (1 developer, full-time)
or 16 weeks (1 developer, 50% time)
Success Metrics
Technical Performance
- Query latency: <500ms (P95)
- Graph size: 100K nodes, 500K relationships
- Update frequency: Daily incremental
- Uptime: 99.9%
User Impact
- Error diagnosis time: <15 minutes (from hours)
- Answer accuracy: >90% correct root cause
- Developer satisfaction: 8+/10 rating
- Query coverage: Handle 95% of structural questions
Research Value
- Novel methodology: Publishable in software engineering venues
- Institutional knowledge: Captured and queryable
- Onboarding: New developers productive in days, not months
Risk Mitigation
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Neo4j performance issues | Low | Medium | POC validates before commitment |
| Parser failures (Fortran) | Medium | Medium | Use tree-sitter, fallback to regex |
| Integration complexity | Low | High | Phased approach, test each phase |
| Developer time unavailable | Medium | High | Extend timeline, reduce scope |
| Graph becomes unmaintainable | Low | High | Automated updates, schema versioning |
Exit Criteria
Abort if:
- POC fails to demonstrate value (Phase 0)
- Phase 1 takes >3 weeks (scope too large)
- Query performance unacceptable after optimization
- Maintenance burden exceeds 4 hours/week
Continue if:
- POC impresses team (visualizations, query results)
- Phases stay on schedule (Β±20%)
- Developers report significant time savings
- Questions previously impossible now answered
Decision Point: End of Phase 0
Review criteria:
- Can Neo4j answer 3+ structural questions ChromaDB cannot? YES/NO
- Is the dependency graph visually impressive and useful? YES/NO
- Do queries return results in <1 second? YES/NO
- Is the team excited to continue? YES/NO
Proceed to Phase 1 if: 3+ YES
Abort if: 2+ NO
Integration with Existing Work
How This Enhances Current Architecture
Enhanced Ingestion Pipeline (Current)
βββ Documentation β ChromaDB β
βββ Source Code β ChromaDB β
βββ Error Logs β ChromaDB β
+ βββ Source Code β Neo4j (structure) π
+ βββ Dependencies β Neo4j (graph) π
+ βββ Error Links β Neo4j (relationships) π
βββ GitHub Data β ChromaDB β
+ βββ GitHub Data β Neo4j (developer graph) π
MCP Tools (Enhanced)
βββ search_documentation (ChromaDB) β
βββ search_code (ChromaDB) β
βββ analyze_error (ChromaDB + Neo4j) π
+ βββ analyze_dependencies (Neo4j) π
+ βββ impact_analysis (Neo4j) π
+ βββ find_similar_code (ChromaDB + Neo4j) π
+ βββ trace_call_chain (Neo4j) π
No Disruption to Existing Work
- ChromaDB ingestion continues unchanged
- LangFlow workflows remain functional
- New Neo4j tools added alongside existing ones
- Gradual migration to hybrid queries
- Fallback to ChromaDB-only if Neo4j unavailable
Next Actions (Updated)
Immediate (This Week)
- β Decision: Approve Neo4j integration concept
- π² Bootstrap: Add Neo4j to docker-compose.yml
- π² POC Weekend: Block 2 days for Phase 0 implementation
- π² Team Review: Present POC results, decide on Phase 1
Short-term (Weeks 1-2)
- π² Phase 1 Start: If POC approved
- π² Parallel Work: Continue ChromaDB ingestion (doesn't block)
- π² Documentation: Update architecture diagrams
Medium-term (Weeks 3-8)
- π² Phases 2-4: Execute according to plan
- π² Weekly Reviews: Track progress, adjust timeline
- π² Integration Testing: Validate hybrid queries
Status: Architecture Extended with Neo4j Graph Database Strategy
Decision: Proceed with Phase 0 Proof of Concept (2 days)
Priority: HIGH - Enables structural queries impossible with vectors alone
Updated: 2025-10-15 15:30 UTC
Next Milestone: Phase 0 POC completion and team review