Ingestor Architecture - fedjo/CEI-InOE GitHub Wiki

CEI-InOE Ingestor Architecture

Overview

The CEI-InOE Ingestor is a modular, connector-based data ingestion system designed to collect, validate, transform, and load data from multiple sources (files, APIs) into a PostgreSQL data warehouse. It uses a worker pool pattern with a message queue for concurrent processing and APScheduler for periodic data discovery.

High-Level Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                              IngestorApp                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚   APScheduler   β”‚    β”‚  Thread-Safe    β”‚    β”‚     WorkerPool          β”‚  β”‚
β”‚  β”‚                 │───▢│     Queue       │───▢│  (n worker threads)     β”‚  β”‚
β”‚  β”‚ discover jobs   β”‚    β”‚                 β”‚    β”‚                         β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚           β”‚                                                 β”‚                β”‚
β”‚           β–Ό                                                 β–Ό                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚            Connectors                   β”‚   β”‚      PipelineRunner        β”‚β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”‚   β”‚                            β”‚β”‚
β”‚  β”‚  β”‚ File β”‚ β”‚ Tago β”‚ β”‚Airbeld β”‚ β”‚Fusionβ”‚ β”‚   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚β”‚
β”‚  β”‚  β”‚ Conn β”‚ β”‚ Conn β”‚ β”‚ Conn   β”‚ β”‚Solar β”‚ β”‚   β”‚  β”‚    DataPipeline      β”‚  β”‚β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β”‚   β”‚  β”‚  validateβ†’transform  β”‚  β”‚β”‚
│  │        │       │        │        │      │   │  │      →stage→load     │  ││
β”‚  β”‚        β–Ό       β–Ό        β–Ό        β–Ό      β”‚   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚β”‚
β”‚  β”‚     InputEnvelope (standardized)        β”‚   β”‚             β”‚              β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚             β–Ό              β”‚β”‚
β”‚                                               β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚β”‚
β”‚                                               β”‚  β”‚      DAOFactory       β”‚  β”‚β”‚
β”‚                                               β”‚  β”‚  (Database Access)    β”‚  β”‚β”‚
β”‚                                               β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚β”‚
β”‚                                               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                      β”‚
                                      β–Ό
                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                           β”‚    PostgreSQL    β”‚
                           β”‚   Data Warehouse β”‚
                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

1. IngestorApp (main.py)

The main application orchestrator that:

  • Initializes all connectors from configuration
  • Manages the APScheduler for periodic discovery jobs
  • Creates and manages the worker pool
  • Handles graceful shutdown on SIGTERM/SIGINT
class IngestorApp:
    connectors: Dict[str, BaseConnector]  # Registry of active connectors
    queue: Queue                          # Thread-safe work queue
    scheduler: BackgroundScheduler        # APScheduler instance
    runner: PipelineRunner               # Pipeline execution handler
    worker_pool: WorkerPool              # Worker thread manager

2. WorkerPool (main.py)

A pool of daemon worker threads that process InputEnvelope messages from the queue:

  • Concurrency: Configurable via NUM_WORKERS environment variable (default: 2)
  • Processing: Each worker runs a loop that:
    1. Gets an envelope from the queue (blocking with timeout)
    2. Resolves the connector from envelope's connector_id
    3. Executes the pipeline via PipelineRunner.run()
    4. Calls connector.ack() on success or connector.fail() on error
  • Graceful Shutdown: Uses threading.Event for coordinated shutdown
Worker Thread Lifecycle:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  while not shutdown_event.is_set():                    β”‚
β”‚      β”œβ”€β–Ί queue.get(timeout=1) β†’ InputEnvelope          β”‚
β”‚      β”œβ”€β–Ί connector = connectors[envelope.connector_id] β”‚
β”‚      β”œβ”€β–Ί metrics = runner.run(envelope)                β”‚
β”‚      β”œβ”€β–Ί connector.ack(envelope)  # or .fail()         β”‚
β”‚      └─► queue.task_done()                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3. Connectors (connectors/)

Connectors are responsible for discovering, fetching, and wrapping data into standardized InputEnvelope objects. All connectors implement the BaseConnector interface.

BaseConnector Interface

class BaseConnector(ABC):
    def start() -> None:                        # Initialize resources
    def stop() -> None:                         # Cleanup resources
    def discover() -> List[str]:               # Find available work items
    def fetch(item_id: str) -> Optional[InputEnvelope]:  # Fetch and wrap data
    def ack(envelope: InputEnvelope) -> None:  # Mark success
    def fail(envelope: InputEnvelope, error: str) -> None:  # Mark failure
    def health() -> dict:                      # Health status

InputEnvelope

A lightweight, standardized data carrier:

class InputEnvelope(BaseModel):
    connector_id: str        # Source connector ID
    input_id: str           # Idempotency key (e.g., "file.csv:{sha256}")
    source_uri: str         # Origin (file path, API URL)
    received_at: datetime   # Timestamp
    content: Any            # Raw data: List[Dict], bytes
    content_type: str       # "csv", "excel", "json"
    
    # Pipeline hints
    hint_mapping: Optional[str]      # YAML mapping file path
    hint_device_id: Optional[str]    # Device identifier
    hint_granularity: Optional[str]  # "hourly", "daily"
    
    # Provenance tracking
    metadata: Dict[str, Any]  # sha256, file_size, date ranges, etc.

Available Connectors

Connector Type Description Authentication Schedule
FileConnector file Watches /data/incoming for CSV/Excel files N/A Every 5 s (FILE_POLL_INTERVAL)
TagoConnector tago Tago.io energy API (hourly/daily) Per-device token header Every 1 hr (TAGO_POLL_INTERVAL)
AirbeldConnector airbeld Airbeld environmental/weather sensors OAuth email/password Every 12 hr (AIRBELD_POLL_INTERVAL)
FusionSolarConnector fusionsolar Huawei FusionSolar PV station API Username/system code Every 1 hr (FUSIONSOLAR_POLL_INTERVAL)
HttpConnector http Generic REST API (base class for API connectors) Bearer, API key, OAuth N/A (base class)

4. PipelineRunner (pipeline_runner.py)

Orchestrates the complete ingestion workflow for a single InputEnvelope:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      PipelineRunner.run()                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  1. Check Duplicates                                             β”‚
β”‚     └─► BatchDAO.exists_by_sha256(envelope.metadata.sha256)     β”‚
β”‚                                                                  β”‚
β”‚  2. Load YAML Mapping                                            β”‚
β”‚     └─► Parse hint_mapping file                                  β”‚
β”‚                                                                  β”‚
β”‚  3. Resolve Datasource                                           β”‚
β”‚     └─► DatasourceDAO.get_by_external_id(hint_device_id)        β”‚
β”‚                                                                  β”‚
β”‚  4. Register Batch                                               β”‚
β”‚     └─► BatchDAO.register() β†’ batch_id UUID                     β”‚
β”‚                                                                  β”‚
β”‚  5. Execute Pipeline                                             β”‚
β”‚     └─► DataPipeline(conn, mapping, context).execute(records)   β”‚
β”‚                                                                  β”‚
β”‚  6. Update Metrics                                               β”‚
β”‚     └─► BatchDAO.update_metrics(quality_score, etc.)            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

5. DataPipeline (pipeline.py)

The core ETL engine that processes records through staging:

Pipeline Stages:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  EXTRACT  β”‚  Records already parsed by connector (CSV rows/API)  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  VALIDATE β”‚  PydanticTransformer validates + transforms          β”‚
β”‚     +     β”‚  - Column mapping (CSV β†’ DB columns)                 β”‚
β”‚ TRANSFORM β”‚  - Type coercion (European decimals, AM/PM dates)    β”‚
β”‚     +     β”‚  - Constraint validation (min/max ranges)            β”‚
β”‚   STAGE   β”‚  - Insert to staging table (raw + transformed)       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   LOAD    β”‚  Move valid records from staging β†’ final table       β”‚
β”‚           β”‚  - Conflict resolution (update/ignore/fail)          β”‚
β”‚           β”‚  - datasource_id injection for fact tables           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

6. PydanticTransformer (pydantic_transformer.py)

Unified validation and transformation using Pydantic models:

  • Column Mapping: Maps source columns (CSV headers) to database columns via YAML config
  • Type Coercion: Handles European decimals (1.234,56), AM/PM dates, integer-from-float
  • Constraint Validation: Uses Pydantic Field(ge=, le=, ...) for range validation
  • Error Collection: Converts Pydantic errors to ValidationResult for staging

7. DAO Layer (dao/)

Centralized database access through the DAOFactory:

dao = DAOFactory(connection)
dao.datasource      # DatasourceDAO - datasource lookup
dao.batch           # BatchDAO - ingest batch registration/dedup
dao.staging('energy_hourly')  # StagingDAO - staging operations
dao.data(conflict_config)     # DataDAO - final table inserts
dao.pipeline        # PipelineDAO - execution logging
dao.cursor          # CursorDAO - API cursor tracking

Data Flow

File Ingestion Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ /data/     β”‚    β”‚ FileConnector  β”‚    β”‚   Queue      β”‚
β”‚ incoming/  │───▢│   .discover()  │───▢│              β”‚
β”‚ file.csv   β”‚    β”‚   .fetch()     β”‚    β”‚ InputEnvelopeβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                               β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚Worker Thread β”‚
             β”‚              β”‚
             β”‚ runner.run() β”‚
             β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β–Ό             β–Ό              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ staging_ β”‚  β”‚ staging_ β”‚  β”‚ staging_ β”‚
β”‚ energy   β”‚  β”‚ environ  β”‚  β”‚ dairy    β”‚
β”‚ hourly   β”‚  β”‚ metrics  β”‚  β”‚ prod     β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
     β”‚             β”‚             β”‚
     β–Ό             β–Ό             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ fact_    β”‚  β”‚ environ  β”‚  β”‚ dairy_   β”‚
β”‚ energy   β”‚  β”‚ mental   β”‚  β”‚ produc   β”‚
β”‚ hourly   β”‚  β”‚ _metrics β”‚  β”‚ tion     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚ /data/       β”‚
            β”‚ processed/   β”‚
            β”‚ file.csv     β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

API Ingestion Flow (Tago.io Example)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   APScheduler  β”‚    β”‚ TagoConnector  β”‚    β”‚   Queue      β”‚
β”‚   every 1 hr   │───▢│  .discover()   │───▢│              β”‚
β”‚                β”‚    β”‚  per device    β”‚    β”‚ InputEnvelopeβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β”‚ For each device:
                              β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚ β”‚ 1. Get cursor timestamp   β”‚
                              β”‚ β”‚ 2. Query API with range   β”‚
                              β”‚ β”‚ 3. Transform response     β”‚
                              β”‚ β”‚ 4. Create InputEnvelope   β”‚
                              β”‚ β”‚ 5. Update cursor          β”‚
                              β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚  Tago.io API   β”‚
                     β”‚ /data?start=.. β”‚
                     β”‚ device-token   β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Design Principles

1. Connector Abstraction

Connectors encapsulate all source-specific logic (authentication, data format, ack/fail semantics). The pipeline only sees standardized InputEnvelope objects.

2. Staging-First Pattern

All records pass through staging tables before final tables:

  • Raw data preserved for debugging
  • Validation errors stored for inspection
  • Atomic loads with rollback capability
  • Audit trail via loaded_to_final flag

3. Idempotency

  • Files: SHA256 hash stored in ingest_batch.file_sha256
  • APIs: Cursor tracking in api_fetch_cursor table
  • Pipeline: DuplicateInputError raised for known inputs

4. Configuration-Driven Mappings

YAML mappings define:

  • Column/field mappings
  • Type coercions
  • Validation constraints
  • Conflict resolution strategy
  • Target/staging table names

5. Observable Pipeline

  • pipeline_execution table logs stage start/end times
  • data_quality_checks stores validation error samples
  • PipelineMetrics dataclass tracks all execution stats

6. Graceful Degradation

  • Per-record error handling (invalid records don't block valid ones)
  • Connector-level isolation (one failing connector doesn't affect others)
  • Automatic retries with exponential backoff for HTTP connectors

Directory Structure

ingestor/
β”œβ”€β”€ Dockerfile                 # Container image definition
β”œβ”€β”€ requirements.txt           # Python dependencies
└── app/
    β”œβ”€β”€ main.py               # Application entry point, WorkerPool
    β”œβ”€β”€ config.py             # Environment-based configuration
    β”œβ”€β”€ pipeline.py           # DataPipeline orchestrator
    β”œβ”€β”€ pipeline_runner.py    # InputEnvelope β†’ Pipeline execution
    β”œβ”€β”€ models.py             # Pydantic data models (BaseRecord, EnergyHourlyRecord, etc.)
    β”œβ”€β”€ pydantic_transformer.py # Unified validation + transformation
    β”œβ”€β”€ validation.py         # ValidationResult, SchemaValidator, TypeValidator
    β”‚
    β”œβ”€β”€ connectors/           # Data source connectors
    β”‚   β”œβ”€β”€ base.py           # BaseConnector, InputEnvelope, ConnectorStatus
    β”‚   β”œβ”€β”€ registry.py       # create_connector() factory
    β”‚   β”œβ”€β”€ file_connector.py # CSV/Excel file watcher
    β”‚   β”œβ”€β”€ http_connector.py # Generic REST API client (base for API connectors)
    β”‚   β”œβ”€β”€ tago_connector.py # Tago.io energy API
    β”‚   β”œβ”€β”€ airbeld_connector.py  # Airbeld environmental/weather API
    β”‚   └── fusionsolar_connector.py  # Huawei FusionSolar PV API
    β”‚
    β”œβ”€β”€ dao/                  # Data Access Objects
    β”‚   β”œβ”€β”€ factory.py        # DAOFactory for centralized access
    β”‚   β”œβ”€β”€ datasource_dao.py # datasource lookup
    β”‚   β”œβ”€β”€ batch_dao.py      # ingest_batch registration + deduplication
    β”‚   β”œβ”€β”€ staging_dao.py    # Staging table operations
    β”‚   β”œβ”€β”€ data_dao.py       # Final table inserts with conflict resolution
    β”‚   β”œβ”€β”€ pipeline_dao.py   # Execution + quality logging
    β”‚   └── cursor_dao.py     # API cursor tracking
    β”‚
    β”œβ”€β”€ conf/
    β”‚   β”œβ”€β”€ datasources.yaml  # 27 datasource definitions (Tago, Airbeld, FusionSolar, File)
    β”‚   └── site_config.yaml  # Site metadata
    β”‚
    β”œβ”€β”€ mappings/             # YAML dataset configurations
    β”‚   β”œβ”€β”€ energy_hourly.yaml
    β”‚   β”œβ”€β”€ energy_daily.yaml
    β”‚   β”œβ”€β”€ environmental_metrics.yaml
    β”‚   β”œβ”€β”€ dairy_production.yaml
    β”‚   β”œβ”€β”€ api_energy_hourly.yaml
    β”‚   β”œβ”€β”€ api_energy_daily.yaml
    β”‚   β”œβ”€β”€ api_environmental_metrics.yaml
    β”‚   β”œβ”€β”€ api_solar_hourly.yaml
    β”‚   β”œβ”€β”€ api_solar_daily.yaml
    β”‚   └── api_solar_monthly.yaml
    β”‚
    └── tests/                # Unit tests
        β”œβ”€β”€ test_models.py
        β”œβ”€β”€ test_pipeline.py
        └── test_transformer.py

Configuration

Environment Variables

Variable Default Description
DB_DSN β€” PostgreSQL connection string (required)
NUM_WORKERS 2 Number of worker threads
QUEUE_MAX_SIZE 100 Maximum queue capacity
LOG_LEVEL INFO Logging verbosity
WATCH_DIR /data/incoming File connector watch directory
PROCESSED_DIR /data/processed Processed files directory
REJECTED_DIR /data/rejected Failed files directory
MAPPINGS_DIR /app/mappings YAML mappings directory
FILE_POLL_INTERVAL 5 File polling interval (seconds)
FILE_STABLE_SECONDS 3 File stability wait before reading (seconds)
TAGO_ENABLED true Enable Tago.io connector
TAGO_API_URL https://api.tago.io Tago.io base URL
TAGO_POLL_INTERVAL 3600 Tago polling interval (seconds)
TAGO_LOOKBACK_DAYS 7 Tago lookback if no cursor
AIRBELD_API_URL β€” Airbeld API base URL
AIRBELD_EMAIL β€” Airbeld account email
AIRBELD_PASSWORD β€” Airbeld account password
AIRBELD_POLL_INTERVAL 43200 Airbeld polling interval (seconds)
AIRBELD_LOOKBACK_DAYS 7 Airbeld lookback if no cursor
FUSIONSOLAR_ENABLED true Enable FusionSolar connector
FUSIONSOLAR_API_URL https://intl.fusionsolar.huawei.com/thirdData FusionSolar API URL
FUSIONSOLAR_USER β€” FusionSolar username
FUSIONSOLAR_SYSTEM_CODE β€” FusionSolar system code
FUSIONSOLAR_POLL_INTERVAL 3600 FusionSolar polling interval (seconds)
FUSIONSOLAR_LOOKBACK_DAYS 30 FusionSolar lookback if no cursor
CONF_DIR /app/conf Configuration files directory

YAML Mapping Example

dataset: energy_hourly
granularity: hourly

# Column mapping: source β†’ database
columns:
  "Date and Time": ts
  "Hourly": energy_kwh

# Type coercions
coercions:
  ts: datetime
  energy_kwh: float

# Validation rules
validation_rules:
  schema:
    required:
      - ts
      - energy_kwh
  constraints:
    energy_kwh:
      min: 0
      max: 10000

# Conflict resolution
conflict_resolution:
  strategy: update           # update | ignore | fail | append
  on_columns:
    - source_batch_id
    - ts
  update_columns:
    - energy_kwh

# Target tables
target_table: fact_energy_hourly
staging_table: staging_energy_hourly

Adding a New Connector

  1. Create connector class extending BaseConnector or HttpConnector
  2. Implement required methods: start(), stop(), discover(), fetch(), ack(), fail(), health()
  3. Register in connectors/registry.py:
    CONNECTOR_TYPES['myapi'] = MyApiConnector
  4. Add entry in ingestor/conf/datasources.yaml for each datasource it manages
  5. Add environment variables in config.py for credentials and scheduling

Adding a New Dataset

  1. Create Pydantic model in models.py extending BaseRecord
  2. Register in MODEL_REGISTRY
  3. Create YAML mapping in mappings/
  4. Add staging table mapping in StagingDAO.STAGING_TABLES
  5. Add SQLAlchemy model to shared/src/shared/models.py
  6. Run alembic revision --autogenerate -m "add_<dataset>_tables"
  7. Add Pydantic schema to shared/src/shared/schemas.py
  8. Add API router + query module if queryable via the API
⚠️ **GitHub.com Fallback** ⚠️