SYSTEM_ARCHITECTURE - pascaldisse/open-sourcefy GitHub Wiki
Open-Sourcefy System Architecture
Overview
Open-Sourcefy is a production-grade AI-powered binary decompilation system that reconstructs compilable C source code from Windows PE executables using a 17-agent Matrix pipeline with Ghidra integration.
Core Principles
- STRICT MODE ONLY: No fallbacks, no alternatives, no graceful degradation
- WINDOWS EXCLUSIVE: Windows PE executables with Visual Studio/MSBuild compilation
- ZERO TOLERANCE: Fail fast when tools are missing - never degrade gracefully
- PRODUCTION READY: NSA-level security, >90% test coverage, SOLID principles
System Components
1. Matrix Pipeline Architecture
┌─────────────────────────────────────────────────────────────────┐
│ MATRIX PIPELINE (17 AGENTS) │
├─────────────────────────────────────────────────────────────────┤
│ Agent 0: Deus Ex Machina (Master Orchestrator) │
├─────────────────────────────────────────────────────────────────┤
│ FOUNDATION AGENTS (1-4): │
│ • Agent 1: Sentinel - Binary Analysis & Import Table Recovery │
│ • Agent 2: Architect - PE Structure & Resource Extraction │
│ • Agent 3: Merovingian - Advanced Analysis │
│ • Agent 4: Agent Smith - Code Flow Analysis │
├─────────────────────────────────────────────────────────────────┤
│ ADVANCED ANALYSIS AGENTS (5-8): │
│ • Agent 5: Neo - Advanced Decompiler │
│ • Agent 6: Trainman - Assembly Analysis │
│ • Agent 7: Keymaker - Resource Reconstruction │
│ • Agent 8: Commander Locke - Build System Integration │
├─────────────────────────────────────────────────────────────────┤
│ RECONSTRUCTION AGENTS (9-12): │
│ • Agent 9: The Machine - Resource Compilation │
│ • Agent 10: Twins - Binary Diff & Validation │
│ • Agent 11: Oracle - Semantic Analysis │
│ • Agent 12: Link - Code Integration │
├─────────────────────────────────────────────────────────────────┤
│ FINAL PROCESSING AGENTS (13-16): │
│ • Agent 13: Agent Johnson - Quality Assurance │
│ • Agent 14: Cleaner - Code Cleanup │
│ • Agent 15: Analyst - Final Validation │
│ • Agent 16: Agent Brown - Output Generation │
└─────────────────────────────────────────────────────────────────┘
2. Core System Components
src/core/
├── agents/ # 17 Matrix Agents (0-16)
├── matrix_pipeline_orchestrator.py # Master Pipeline Controller
├── matrix_agents_v2.py # Agent Framework & Base Classes
├── config_manager.py # Configuration Management
├── build_system_manager.py # VS2022 Build Integration
├── shared_components.py # Shared Agent Components
├── ghidra_processor.py # Ghidra 11.0.3 Integration
├── ai_system.py # AI Engine Interface
└── exceptions.py # Error Handling System
3. Data Flow Architecture
INPUT → GHIDRA → MATRIX PIPELINE → BUILD SYSTEM → VALIDATION → OUTPUT
↓ ↓ ↓ ↓ ↓ ↓
PE.EXE → C CODE → RESOURCES → MSBuild → TESTS → RECONSTRUCTED.EXE
Agent Specifications
Agent 0: Deus Ex Machina (Master Orchestrator)
- Purpose: Master control and coordination
- Input: Target PE executable
- Output: Orchestrated pipeline execution
- Critical Functions:
- Pipeline initialization and coordination
- Agent dependency management
- Error propagation and recovery
- Quality gate enforcement
Foundation Agents (1-4)
Agent 1: Sentinel
- Purpose: Binary analysis and import table recovery
- Critical Issue: Import table mismatch (538→5 DLLs)
- Input: PE executable
- Output: Import table, function signatures, DLL dependencies
- Key Functions:
- PE header analysis
- Import table reconstruction
- MFC 7.1 signature detection
- Ordinal resolution
Agent 2: Architect
- Purpose: PE structure and resource extraction
- Input: PE executable, Sentinel output
- Output: Resources, structure analysis
- Key Functions:
- Resource section extraction
- Icon/bitmap extraction
- Version info recovery
- Manifest processing
Agent 3: Merovingian
- Purpose: Advanced analysis and pattern recognition
- Input: PE structure, binary data
- Output: Code patterns, algorithms
- Key Functions:
- Algorithm identification
- Code pattern analysis
- Obfuscation detection
- Compiler fingerprinting
Agent 4: Agent Smith
- Purpose: Code flow analysis
- Input: Disassembly, structure data
- Output: Control flow graphs, function boundaries
- Key Functions:
- Control flow reconstruction
- Function identification
- Call graph generation
- Dead code elimination
Advanced Analysis Agents (5-8)
Agent 5: Neo
- Purpose: Advanced decompilation
- Input: Binary code, control flows
- Output: C source code (readable main)
- Key Functions:
- High-level C reconstruction
- Variable type inference
- Function signature recovery
- Meaningful name generation
Agent 6: Trainman
- Purpose: Assembly analysis
- Input: Raw assembly
- Output: Assembly annotations, optimizations
- Key Functions:
- Instruction pattern analysis
- Optimization detection
- Register usage analysis
- Stack frame reconstruction
Agent 7: Keymaker
- Purpose: Resource reconstruction
- Input: Extracted resources
- Output: RC files, resource headers
- Key Functions:
- RC file generation
- Resource compilation
- String table reconstruction
- Icon/bitmap integration
Agent 8: Commander Locke
- Purpose: Build system integration
- Input: Source code, resources
- Output: VS project files, build configuration
- Key Functions:
- VS2022 project generation
- MSBuild configuration
- Dependency management
- Compilation orchestration
Reconstruction Agents (9-12)
Agent 9: The Machine
- Purpose: Resource compilation
- Input: RC files, resources
- Output: Compiled resource files (.res)
- Key Functions:
- RC.EXE compilation
- Resource linking
- Binary resource generation
- MFC 7.1 compatibility
Agent 10: Twins
- Purpose: Binary diff and validation
- Input: Original binary, reconstructed binary
- Output: Diff analysis, validation report
- Key Functions:
- Binary comparison
- Functionality validation
- Import table verification
- Size/structure analysis
Agent 11: Oracle
- Purpose: Semantic analysis
- Input: Source code, binary behavior
- Output: Semantic annotations, optimizations
- Key Functions:
- Semantic code analysis
- Behavior verification
- Logic optimization
- Code quality assessment
Agent 12: Link
- Purpose: Code integration
- Input: Multiple code components
- Output: Integrated source code
- Key Functions:
- Component integration
- Dependency resolution
- Code merging
- Final assembly
Final Processing Agents (13-16)
Agent 13: Agent Johnson
- Purpose: Quality assurance
- Input: Integrated code
- Output: QA report, compliance verification
- Key Functions:
- Code quality validation
- Standards compliance
- Security assessment
- Performance analysis
Agent 14: Cleaner
- Purpose: Code cleanup
- Input: Raw generated code
- Output: Clean, formatted code
- Key Functions:
- Code formatting
- Comment generation
- Dead code removal
- Style normalization
Agent 15: Analyst
- Purpose: Final validation
- Input: Clean code, resources
- Output: Final validation report
- Key Functions:
- Comprehensive testing
- Regression validation
- Performance benchmarking
- Success rate analysis
Agent 16: Agent Brown
- Purpose: Output generation
- Input: Validated code and resources
- Output: Final deliverables
- Key Functions:
- Final package generation
- Documentation creation
- Archive preparation
- Deployment packaging
Build System Integration
Visual Studio 2022 Preview (EXCLUSIVE)
- Compiler: cl.exe (configured paths only)
- MSBuild: MSBuild.exe (no fallbacks)
- SDK: Windows SDK (required)
- No Alternatives: Single build path, strict validation
Resource Compilation Pipeline
RC Files → RC.EXE → .RES Files → LINK.EXE → Final Binary
Configuration Management
Centralized Configuration
config.yaml
: Main configurationbuild_config.yaml
: Build system paths- Environment validation on startup
- No hardcoded values allowed
Path Management
- Absolute paths only
- No relative path alternatives
- Strict path validation
- Configured tools only
Error Handling
Fail-Fast Philosophy
- Immediate failure on missing tools
- No graceful degradation
- No alternative code paths
- Strict prerequisite validation
Error Categories
- FATAL: Missing required tools/dependencies
- CRITICAL: Agent execution failures
- WARNING: Quality threshold violations
- INFO: Progress and status updates
Quality Assurance
Testing Strategy
- Unit Tests: >90% coverage requirement
- Integration Tests: Pipeline validation
- Regression Tests: Binary comparison
- Performance Tests: Execution time benchmarks
Validation Criteria
- Binary functionality match
- Import table completeness
- Resource integrity
- Compilation success
Security Architecture
NSA-Level Security
- Zero hardcoded credentials
- Secure temporary file handling
- Memory cleanup procedures
- Access control validation
Threat Model
- Malicious binary protection
- Code injection prevention
- Resource manipulation detection
- Build system isolation
Performance Optimization
Parallel Execution
- Agent-level parallelization
- Resource compilation optimization
- I/O operation batching
- Memory usage optimization
Scalability
- Agent isolation
- Resource pooling
- Caching strategies
- Load balancing
Monitoring & Observability
Logging Framework
- Structured logging
- Agent-specific logs
- Performance metrics
- Error tracking
Metrics Collection
- Pipeline success rates
- Agent execution times
- Resource usage
- Quality scores
Deployment Architecture
Production Environment
- Windows Server 2022
- Visual Studio 2022 Preview
- Ghidra 11.0.3
- Python 3.11+
Container Support
- Windows containers only
- VS Build Tools integration
- Ghidra headless mode
- Resource compilation support
Known Issues & Solutions
Import Table Mismatch (PRIMARY BOTTLENECK)
- Issue: 538→5 DLL reduction, 64.3% discrepancy
- Impact: 25% validation failure
- Solution: Agent 9 data flow repair, MFC 7.1 integration
- Expected: 60% → 85% success rate improvement
MFC 7.1 Compatibility
- Issue: VS2022 incompatible with MFC 7.1
- Solution: Alternative build approach research
- Status: Implementation ready
Maintenance & Updates
Version Control
- Git-based workflow
- Branch protection rules
- Mandatory code review
- Automated testing
Documentation Standards
- Architecture documentation
- Agent specifications
- API documentation
- Deployment guides
Future Enhancements
Planned Features
- Multi-compiler support research
- Advanced obfuscation handling
- Machine learning integration
- Cloud deployment options
Research Areas
- Binary similarity analysis
- Advanced packing detection
- Automated testing generation
- Performance optimization