Recursive Orchestration & Resource Isolation - nshaibu/volnux GitHub Wiki

Volnux Technical Design: Recursive Orchestration & Resource Isolation

Status: Draft
Area: Core Engine / Execution Layer
Author: nshaibu

Problem: Recursive Deadlocks & Resource Starvation (The "Fork Bomb" Risk)

1. Introduction

The Volnux Framework utilises a recursive orchestration model where workflows can contain Groupings {} and BatchPipelines. In this model, an engine instance may delegate the execution of a sub-graph to a sub-engine. While this provides excellent encapsulation and state isolation, it introduces the risk of "Recursive Resource Starvation" (a form of a Fork Bomb).

2. The Problem: The Recursive Deadlock

In a naive implementation, if a parent workflow is executed within a worker process, it occupies a slot in the system’s resource pool. If that parent workflow then attempts to schedule its children into the same pool, a deadlock occurs:

Parent occupies Worker Slot A and waits for Child.
Child is queued but cannot be assigned a worker because all slots (including Slot A) are occupied by parents.
The System Deadlocks, resulting in 0% throughput despite 100% resource utilization.

3. The Resolution Strategy: Tiered Execution Logic

To resolve this, Volnux implements a two-pronged strategy that separates Orchestration (The Brain) from Execution (The Muscle).

A. Non-Blocking Orchestration (Option 1)

Volnux enforces a strict rule regarding the execution environment of Engines and Sub-Engines:

Engine Layer (Orchestrators): All engine instances (including BatchPipeline and TaskGrouping controllers) must execute in a non-blocking, lightweight layer (AsyncIO or Threading).
Task Layer (Executors): Only the "leaf" tasks (the actual business logic) are permitted to occupy slots in the ProcessPoolExecutor.

Result: Since the "Orchestrator" never occupies a process worker slot, it cannot starve the "Tasks" it is trying to schedule.

B. Singleton Worker Pool (GlobalPoolManager)

To prevent the system from spawning an uncontrolled number of processes (Fork Bomb), Volnux overrides the standard ProcessPoolExecutor with a Singleton Worker Pool.

Pre-Initialization: A system-wide worker pool is initialized at startup with a hard-capped number of workers (typically N workers for N CPU cores).
Centralized Queue: All Engines—regardless of their depth in the recursive tree—submit their CPU-bound tasks to the same global queue.
Warm Workers: Processes are kept "warm," eliminating the O(n) latency of spawning new Python interpreters for every task.

import multiprocessing
import logging
from concurrent.futures import ProcessPoolExecutor
from threading import Lock
from functools import partial

logger = logging.getLogger("volnux.resource_manager")

class GlobalPoolManager:
    """
    Singleton Manager for Volnux Worker Processes.
    Ensures a hard-cap on system resources and prevents recursive fork bombs.
    """
    _instance = None
    _lock = Lock()

    def __new__(cls, *args, **kwargs):
        with cls._lock:
            if cls._instance is None:
                cls._instance = super(GlobalPoolManager, cls).__new__(cls)
                cls._instance._initialized = False
                cls._instance._executor = None
            return cls._instance

    def initialize(self, max_workers: int = None):
        """
        Initializes the warm pool once. 
        Subsequent calls will ignore and return the existing instance.
        """
        with self._lock:
            if self._initialized:
                return

            # Default to CPU count if not provided
            num_workers = max_workers or multiprocessing.cpu_count()
            
            # Using 'spawn' context is critical to avoid deadlocks 
            # when the parent engine is multi-threaded.
            ctx = multiprocessing.get_context('spawn')
            
            try:
                self._executor = ProcessPoolExecutor(
                    max_workers=num_workers,
                    mp_context=ctx,
                    initializer=self._worker_init
                )
                self._initialized = True
                logger.info(f"Volnux Global Pool initialized with {num_workers} workers.")
            except Exception as e:
                logger.error(f"Failed to initialize GlobalPoolManager: {e}")
                raise

    @staticmethod
    def _worker_init():
        """
        Runs inside each worker process at startup.
        Used for setting up signal handlers or clearing local state.
        """
        pass

    def submit_task(self, task_func, *args, **kwargs):
        """
        Submits a leaf-task for execution. 
        Engines call this to delegate 'Muscle' work.
        """
        if not self._initialized:
            self.initialize()
            
        return self._executor.submit(task_func, *args, **kwargs)

    def shutdown(self):
        """Gracefully closes the pool."""
        with self._lock:
            if self._executor:
                self._executor.shutdown(wait=True)
                self._initialized = False
                logger.info("Volnux Global Pool shut down.")

# Usage example for the Task Layer:
future = GlobalPoolManager().submit_task(cpu_bound_logic, data)

4. Architectural Implementation

Task Grouping {} and Batching

TaskGrouping: Handled as a composite task. The main engine relinquishes delegation to a sub-engine which runs in a thread.
BatchPipeline: Viewed as a high-scale extension of the Grouping pattern. It chunks data and spawns parallel sub-engines in an async/threaded layer.
Restriction: Grouping and Batching orchestrators are forbidden from using ProcessPoolExecutor. They are limited to ThreadPoolExecutor, AsyncExecutor, or RemoteExecutor (e.g. GRPCExecutor, TCPExecutor).

5. Monadic Grouping & Context Aggregation

Volnux treats Task Groupings as Logical Decision Nodes.

Context Isolation: Every sub-engine operates on a new branch of the ExecutionContext tree.
State Aggregation: Upon completion of a group, the sub-engine applies the ResultEvaluationStrategy (e.g., ALL_MUST_SUCCEED) to the leaf nodes of the sub-context.
Path Selection: This results in a single descriptor (0 or 1) being returned to the Parent Engine.
Data Integrity: The Parent Engine retains a reference to the root of the sub-context tree. This allows users to programmatically traverse the doubly-linked list of nodes to inspect specific errors, even after the sub-engine has been de-allocated.

6. Decoupled Context Persistence

A key architectural pillar of Volnux is the separation between Execution (Engines) and State (Context).

State as a Tree: Because every grouping {} creates a sub-context, the final execution state is a Fractal Tree.
Bidirectional Traversal: The use of a doubly linked list at each node allows the user to perform "Time-Travel Debugging"—moving backward from a failure node to find the specific input that caused it.
Engine as a Pure Function: The Engine remains stateless; it consumes the ExecutionContext, performs a transition, and updates the tree. This makes the Engine highly swappable (e.g., moving from a Local Engine to a Remote Celery Engine) without losing history.

7. Conclusion

By ensuring that the "waiting" components (Engines) never consume "parallel" resources (Processes), Volnux achieves a fractal, recursive architecture that is immune to local fork bombs and resource starvation. This architecture allows Volnux to scale from simple local scripts to complex, multi-layered batch processing pipelines with predictable memory and CPU footprints.