Architecture Overview - pascaldisse/open-sourcefy GitHub Wiki

Architecture Overview

Open-Sourcefy implements a sophisticated 17-agent Matrix pipeline designed for comprehensive binary decompilation and source code reconstruction.

Core Philosophy

The Matrix Framework

The system is based on the Matrix metaphor, where each agent represents a specialized character from the Matrix universe, each with unique capabilities and responsibilities in the decompilation process.

Design Principles

  • Master-First Execution: Agent 0 (Deus Ex Machina) orchestrates all operations
  • Dependency-Based Batching: Agents execute in carefully ordered batches based on data dependencies
  • Fail-Fast Validation: Immediate termination on missing requirements or validation failures
  • NSA-Level Security: Zero tolerance for vulnerabilities throughout the pipeline

Agent Pipeline Flow

Phase 1: Master Orchestration

Agent 0: Deus Ex Machina (Master Orchestrator)
├── Pipeline coordination and resource allocation
├── Agent dependency resolution and execution ordering
├── Quality gate enforcement and validation checkpoints
└── Error handling and recovery coordination

Phase 2: Foundation Analysis

Agent 1: Sentinel (Binary Discovery & Security Scanning)
├── Binary format detection and validation
├── Import/export table analysis (538+ functions)
├── Security scanning and threat assessment
└── Metadata extraction and cataloging
         ↓
Parallel Batch 1: Agents 2, 3, 4
├── Agent 2: The Architect (Architecture Analysis)
│   ├── Compiler detection and optimization analysis
│   ├── ABI and calling convention identification
│   └── Build system recognition
├── Agent 3: The Merovingian (Basic Decompilation)
│   ├── Function identification and signature analysis
│   ├── Assembly instruction analysis
│   └── Basic code pattern recognition
└── Agent 4: Agent Smith (Binary Structure Analysis)
    ├── Data structure identification
    ├── Resource extraction and cataloging
    └── Dynamic analysis instrumentation

Phase 3: Advanced Analysis

Parallel Batch 2: Agents 5, 6, 7, 8
├── Agent 5: Neo (Advanced Decompilation with Ghidra)
│   ├── Headless Ghidra integration
│   ├── Advanced function recovery
│   └── Type inference and data structure recovery
├── Agent 6: The Twins (Binary Differential Analysis)
│   ├── Binary comparison and validation
│   ├── Version analysis and change detection
│   └── Integrity verification
├── Agent 7: The Trainman (Advanced Assembly Analysis)
│   ├── Optimization pattern detection
│   ├── Compiler-specific analysis
│   └── Performance characteristic analysis
└── Agent 8: The Keymaker (Resource Reconstruction)
    ├── Icon, dialog, and string resource extraction
    ├── Resource compilation and linking
    └── Asset reconstruction and validation

Phase 4: Reconstruction & Compilation

Parallel Batch 3: Agents 9, 12, 13
├── Agent 9: Commander Locke (Global Reconstruction)
│   ├── Complete source code generation
│   ├── Build system integration (MSBuild/CMake)
│   └── Compilation orchestration (4.3MB outputs)
├── Agent 12: The Machine (Compilation Orchestration)
│   ├── VS2022 Preview integration
│   ├── Dependency resolution and linking
│   └── Build validation and testing
└── Agent 13: The Oracle (Final Validation)
    ├── Semantic analysis and validation
    ├── Quality assessment and scoring
    └── Compliance verification

Phase 5: Quality Assurance

Sequential Processing: Agents 10, 11
Agent 10: → Agent 11: (Cross-reference and linking)
├── Function cross-referencing
├── Symbol resolution and validation
└── Inter-module dependency analysis
         ↓
Final Batch: Agents 14, 15, 16
├── Agent 14: Agent Johnson (Security Analysis)
│   ├── Security vulnerability assessment
│   ├── Code quality analysis
│   └── Compliance validation
├── Agent 15: The Cleaner (Code Cleanup)
│   ├── Code formatting and standardization
│   ├── Comment generation and documentation
│   └── Final code polishing
└── Agent 16: The Analyst (Final Intelligence)
    ├── Comprehensive metadata synthesis
    ├── Quality reporting and documentation
    └── Pipeline success validation

Technical Architecture

Core Framework Components

Matrix Pipeline Orchestrator

File: src/core/matrix_pipeline_orchestrator.py

  • Responsibility: Master coordination of all agents
  • Features: Dependency resolution, parallel execution, error handling
  • Status: ✅ Production-ready (1,004 lines)

Agent Base Framework

File: src/core/shared_components.py

  • Responsibility: Common agent functionality and interfaces
  • Features: AgentResult handling, validation, logging
  • Status: ✅ Production-ready with comprehensive utilities

Configuration Management

File: src/core/config_manager.py

  • Responsibility: System configuration and environment management
  • Features: YAML configuration, environment validation
  • Status: ✅ Operational with build_config.yaml integration

Agent Implementation Pattern

Each agent follows a consistent implementation pattern:

class AgentX_MatrixCharacter(ReconstructionAgent):
    def __init__(self):
        super().__init__(
            agent_id=X,
            matrix_character=MatrixCharacter.CHARACTER_NAME
        )
        
    def execute_matrix_task(self, context: Dict[str, Any]) -> Dict[str, Any]:
        # Agent-specific implementation
        pass
        
    def _validate_prerequisites(self, context: Dict[str, Any]) -> None:
        # Dependency validation
        pass

Data Flow Architecture

Context Dictionary

Agents communicate through a shared context dictionary containing:

  • Binary path: Target binary for analysis
  • Agent results: Output from completed agents
  • Shared memory: Cross-agent data storage
  • Configuration: Runtime settings and parameters

AgentResult Objects

AgentResult(
    agent_id=int,
    status=AgentStatus,
    data=Dict[str, Any],
    agent_name=str,
    matrix_character=str
)

Output Structure

output/{binary_name}/{timestamp}/
├── agents/          # Individual agent outputs
├── ghidra/          # Ghidra decompilation results
├── compilation/     # Generated source and build files
├── reports/         # Pipeline execution reports
└── logs/            # Detailed execution logs

Quality Assurance Framework

Validation Checkpoints

  • Agent Prerequisites: Dependency validation before execution
  • Output Validation: Schema and content validation after execution
  • Quality Thresholds: Minimum quality scores for pipeline progression
  • Compilation Testing: Generated code compilation verification

Error Handling Strategy

  • Fail-Fast: Immediate termination on critical errors
  • Graceful Degradation: Conditional features based on available tools
  • Comprehensive Logging: Full execution tracing for debugging
  • Recovery Mechanisms: Automatic retry for transient failures

Performance Metrics

  • Pipeline Success Rate: 100% (16/16 agents operational)
  • Execution Time: <30 minutes for typical binaries
  • Memory Usage: Optimized for 16GB+ systems
  • Output Quality: 83.36% size accuracy for binary reconstruction

Integration Points

External Tool Integration

  • Ghidra: Headless decompilation engine integration
  • Visual Studio 2022 Preview: Compilation and build system
  • Windows SDK: Resource compilation and linking tools
  • AI Services: Claude integration for enhanced analysis

Build System Integration

  • MSBuild: Primary build system for Windows compilation
  • CMake: Cross-platform build file generation
  • Resource Compiler: RC.EXE integration for resource processing
  • Linker Integration: LIB.EXE and LINK.EXE for final assembly

Security Architecture

NSA-Level Security Standards

  • No Hardcoded Values: All configuration externalized
  • Input Sanitization: Comprehensive validation of all inputs
  • Secure File Handling: Temporary file management and cleanup
  • Access Control: Strict permission validation throughout

Threat Mitigation

  • Code Injection Prevention: Sanitized execution environments
  • Resource Exhaustion Protection: Memory and CPU usage limits
  • Privilege Escalation Prevention: Minimal required permissions
  • Data Exfiltration Prevention: Controlled output and logging

Scalability and Performance

Parallel Execution

  • Batch Processing: Agents execute in parallel where dependencies allow
  • Resource Management: Intelligent CPU and memory allocation
  • Load Balancing: Work distribution across available cores
  • Caching: Intermediate result caching for performance

Optimization Strategies

  • Lazy Loading: Components loaded only when needed
  • Memory Management: Efficient memory usage and cleanup
  • Disk I/O Optimization: Minimized file system operations
  • Network Optimization: Efficient external tool communication

Next: Agent Documentation - Detailed agent specifications
Related: Getting Started - Installation and setup guide