Why another worflow orchestrator? - nshaibu/volnux GitHub Wiki
Volnux Framework: The Rationale for a New Orchestrator
This document outlines the core features of the Volnux framework, demonstrating how its design uniquely addresses the performance, usability, and collaboration demands of modern, event-driven data and ML pipelines. The central rationale for Volnux is to provide Python's productivity with cloud-native scale and architectural resilience.
I. Volnux Core Ambition
To build a High-Performance, Highly Resilient, and Collaboration-Fostering event-driven workflow orchestrator, specifically optimised for MLOps, Computer Vision, and modern data engineering workloads.
Volnux's Architecture is built on 3 pillars to overcome the limitations of traditional orchestrators/schedulers:
- High Performance: Achieved through intelligent hybrid concurrency, bypassing GIL.
- Resilience and Scalability: Delivered by adaptive resource allocation and robust distributed execution models.
- Collaboration and Simplicity: Enabled by intuitive Pointy-Lang and decoupled task management
II. Feature Breakdown and Ambition Fulfilment
A. Feature 1: Declarative Workflow DSL (Pointy-lang)
| Description | Ambition Fulfilled | Practical Use Cases |
|---|---|---|
| Graph-Based Definition: Uses simple and intuitive syntax to define sequence, parallel execution, and conditional branching. | Collaboration & Ease of Use: Hides the complexity of distributed computing behind an intuitive, readable language. Non-technical users (e.g., Business Analysts) can define complex business rules without writing Python code. | |
| Input Schema Injection: The Pipeline class defines required workflow inputs, which are automatically injected into tasks (Dependency Injection). | Productivity & Resilience: Provides strong data typing and validation upfront, preventing runtime errors. Cleans up task signatures, focusing engineers purely on business logic. | API/Webhook Integration: Defines a strict, self-documenting contract for external services calling the workflow via a webhook. |
B. Feature 2: Execution Engine and Performance Backends
| Description | Ambition Fulfilled | Practical Use Cases |
|---|---|---|
| Hybrid Execution Layer: Tasks are submitted to various backends: Python's ThreadPoolExecutor (I/O-bound) or ProcessPoolExecutor (CPU-bound), C/Rust bindings, or remote execution engines (Kafka, Kubernetes). | High Performance & Resilience: Bypasses the Python Global Interpreter Lock (GIL) for true parallelism in CPU-intensive tasks (e.g., model training). Utilises optimised bindings for extreme performance where needed. | MLOps Training: Use ProcessPoolExecutor for parallel model training across cores, while using ThreadPoolExecutor for concurrent metadata logging (I/O) to an external database. |
| GPU/CPU Fallback: Tasks can be flagged for GPU execution, but automatically and gracefully degrade to CPU execution if the GPU resource is unavailable or unhealthy. | Resilience & Portability: Ensures that workflows never fail due to hardware absence. The same workflow definition runs on a GPU-enabled production cluster for high performance and a CPU-only staging environment for testing. | Cost Optimisation: Non-critical inference jobs can run on cheaper CPU workers during off-peak hours, preserving expensive GPU resources for priority tasks. |
C. Feature 3: Scaling and Operational Intelligence
| Description | Ambition Fulfilled | Practical Use Cases |
|---|---|---|
| Adaptive Scaling: Runtime monitor tracks CPU and memory utilisation against a predefined quota, automatically adjusting worker and queue sizes. | High Performance & Resilience: Provides self-tuning capability. Eliminates resource over-provisioning (cost savings) and under-provisioning (performance loss). Guarantees stability during unpredictable traffic spikes. | Event Burst Handling: Handles massive, sudden influx of events (e.g., marketing campaign launch) by rapidly scaling workers up, then scaling down gradually after the peak load subsides, optimising cloud costs. |
| Batch Pipeline Execution: A declarative field on the Pipeline class defines a batch/chunk size. Volnux automatically splits large input data and spawns multiple, parallel workflow instances (submitted via pools or Kafka). | Scalability & Specialised Fit (Data/ML): Turns a large single-workflow job into many smaller, fault-tolerant, concurrent jobs. This is essential for processing massive datasets found in MLOps and Big Data ETL. | Parallel Inference: Processing 1 million customer records by splitting them into 1,000 batches, allowing the entire model inference workflow to run 1,000 times in parallel. |
D. Feature 4: Collaboration and Reusability
| Description | Ambition Fulfilled | Practical Use Cases |
|---|---|---|
| External Task Hosting: Tasks do not have to be implemented locally; they can be pulled and instantiated from external, versioned repositories (PyPI, GitHub). | Collaboration & Ease of Use: Enforces a clean separation between Task Implementation (Engineers) and Workflow Design (Analysts). Promotes code reusability across the entire organisation. | Reusable Task Libraries: A Platform Team publishes a standardised pypi::db_connector task. Data Scientists consume this task in their Pointy-lang workflows, guaranteeing consistency in database access across all projects. |
| Triggers and Nested Workflows: Tasks can trigger the execution of other workflows (Child Workflows), with the option to wait synchronously or execute asynchronously/in parallel. | Resilience & Clarity: Enables the breakdown of massive processes into small, reusable modules. The use of triggers supports both synchronous orchestration (Saga pattern) and asynchronous event-driven choreography. | Payment Retry System: A parent order task triggers a child workflow to handle payment retries and waits for the child workflow's final success/failure, ensuring robust transaction management. |
III. Conclusion: The Volnux Rationale
Volnux is not simply another Python wrapper for a scheduler; it is a purpose-built, hybrid orchestrator designed to meet the extreme demands of modern AI-driven enterprises. It addresses the fundamental flaws of existing tools:
- Airflow's Rigidity: Volnux moves beyond scheduled-only batch processing to be truly event-driven.
- General Python's GIL: Volnux's Hybrid Executor and Adaptive Scaling ensure high performance and resilience under heavy load.
- Complexity: The Pointy-lang DSL and External Task Model make workflow creation accessible to the entire organization, fostering cross-functional collaboration.
Volnux delivers a powerful combination of simplicity, speed, and reliability required to build the next generation of scalable, automated MLOps and Data Systems.