Data Pipeline - toddbadams/strawberry GitHub Wiki

Data Warehouse Pipeline

To support reliable analytics and modeling on quarterly financial data, we’ve designed a robust, layered data pipeline based on medallion architecture principles. This pipeline ingests raw API data, validates and enriches it, and ultimately transforms it into star-schema fact and dimension tables within a central data warehouse. The diagram below outlines each major stage of the pipeline, followed by a detailed breakdown of its components and responsibilities.

flowchart TD
    subgraph Acquisition
        A[Raw API Data] --> B[Acquisition Service]
        B --> C[Bronze Layer - Raw Storage]
    end

    subgraph Validation
        C --> D[Validation Service]
        D --> E[Silver Layer - Validated / Typed]
        D -->|Errors| DX[Data Quality Logs]
    end

    subgraph Transformation
        E --> F[Transformation & Enrichment Service]
        F --> G[Gold Layer - Modeled Tables]
        F -->|Audit Metadata| MX[Metadata Registry]
    end

    subgraph Modeling
        G --> H[Dimension Services - Stocks, Sectors, Markets]
        G --> I[Fact Services - Financials]
        H --> J[Dimension Tables]
        I --> K[Fact Tables]
    end

    J --> L[Data Warehouse - Star Schema]
    K --> L

    subgraph Downstream
        L --> M[Analytics & Dashboards]
        L --> N[ML Feature Store]
        L --> O[Export to External Consumers]
    end

Acquisition

  • Raw API Data: Quarterly financial data is fetched directly from a third-party source (e.g., EDGAR, Alpha Vantage APIs).

  • Acquisition Service: Handles pulling the data, logging source metadata (timestamp, ticker, etc.).

  • Bronze Layer – Raw Storage: Stores unmodified, original data for traceability and reprocessing.

  • Sources: Alpha Vantage APIs

  • Tools: Python extract scripts in the acquisition folder

  • Storage: Raw data stored in parquet format

  • Orchestration: Prefect DAGs

  • Folder: acquisition

Table Partition
BALANCE_SHEET symbol
CASH_FLOW symbol
DIVIDENDS symbol
EARNINGS symbol
INCOME_STATEMENT symbol
INSIDER_TRANSACTIONS symbol
OVERVIEW -
TIME_SERIES_MONTHLY_ADJUSTED symbol

Validation

  • Validation Service: Applies schema checks, data typing, null handling, and basic cleanup.

  • Silver Layer – Validated/Typed: Cleaned and standardized data; safe for transformation and modeling.

  • Data Quality Logs: Captures issues like missing fields, type mismatches, or late/malformed data for auditing and alerting.

  • Sources: Alpha Vantage APIs

  • Tools: Python extract scripts in the acquisition folder and saves to the validation folder

  • Cleaning: fill or impute missing values, dedupe records, standardize formats

  • Storage: Formatted data (date/time, float, int, str) stored in parquet format

  • Orchestration: Prefect DAGs

  • Folder: validated

Table Partition
BALANCE_SHEET symbol
CASH_FLOW symbol
DIVIDENDS symbol
EARNINGS symbol
INCOME_STATEMENT symbol
INSIDER_TRANSACTIONS symbol
OVERVIEW -
TIME_SERIES_MONTHLY_ADJUSTED symbol

Transformation

  • Transformation & Enrichment Service: Derives new metrics (e.g., ROE, EBITDA margin), performs currency normalization, standardizes structures.
  • Gold Layer – Modeled Tables: Fully enriched, analysis-ready data structured in a consistent format across companies.
  • Metadata Registry: Logs lineage, transformations applied, column-level metadata, and audit trail for governance.

Modeling

  • Dimension Services: Creates entity reference tables—stocks, sectors, markets—using SCD logic where necessary.

  • Fact Services: Builds tables containing numeric, time-series data like revenue, profit, cash flow, etc.

  • Dimension Tables: Entities and descriptors (e.g., company names, sectors, geography).

  • Fact Tables: Measures tied to dimensions and time (e.g., Q2 2025 revenue for MSFT).

  • Enrichment: joins, aggregations, type conversions using python in the transformation folder

  • Feature engineering: date parts, rolling statistics, column level calculators

  • Orchestration: Prefect DAGs

  • Ratios:

  • AlphaPulse: profitability (ROA), growth, leverage, valuation (earnings yield), momentum, stability and a weighted composite

  • Dividend Safety: payout ratios (earnings & FCF), leverage, coverage metrics, volatility, streaks, drawdowns and a composite safety score

  • Folder: transformed

Table Partition
DIM_STOCK -
FACT_QTR_FINANCIALS symbol
  • Dimension = the “who/what/where/when” context. Descriptive attributes (names, categories, hierarchies).
  • Fact = the “how many/how much” measurements. Numeric metrics tied to a specific grain (event) and foreign keys to dimensions.
Dimension Fact
Role Describes context (nouns) Stores measures/events (verbs, numbers)
Examples dim_customer, dim_date, dim_security fact_sales, fact_trades, fact_price_daily
Columns Text/labels, hierarchies, surrogate keys Foreign keys to dims + numeric measures
Size/Change Smaller, slowly changing (SCD1/2/etc.) Very large, mostly insert-only/appended
Grain One row per entity at its natural grain One row per event/observation at the declared grain
Additivity Not applicable Measures are additive, semi-additive, or non-additive

How to tell which is which

  • If you’re asking “by what/along what axis do I slice?” → It’s a dimension.
  • If you’re asking “how many/how much did we…” → It’s a fact.
  • If it mostly holds numbers you aggregate (sum, avg, min/max) → fact.
  • If it mostly holds descriptions (name, type, sector, category) → dimension.

Example (stocks domain)

  • Dimension: dim_stock (ticker, company name, sector, currency, IPO date, …)
  • Fact: fact_price_daily (ticker_key, date_key, open, high, low, close, volume)

Nuances

  • Factless fact tables: no measures, just the occurrence of an event (e.g., a student attended a class).
  • Bridge tables: handle many‑to‑many between facts and dimensions (e.g., trade ↔ multiple brokers).
  • Degenerate dimensions: IDs sitting in the fact table itself (e.g., transaction_id as a textual attribute).

Data Warehouse – Star Schema

  • Combines fact and dimension tables into a star schema optimized for analytics, slicing/dicing, and time-series comparison.

Downstream Consumption

  • Analytics & Dashboards: Power BI, Tableau, or custom Streamlit apps use this data for executive dashboards and KPI tracking.
  • ML Feature Store: Provides cleaned, ready-to-use features to machine learning models (e.g., for scoring or prediction).
  • Export to External Consumers: Enables pushing data to clients, partners, or reporting systems (e.g., CSVs, APIs, S3, etc.).

Data Warehouse Build Checklist

Here’s a checklist you can use to audit and improve your process:

Ingestion & Raw Storage

  • Ingests all raw API data in original schema
  • Stores raw data in append-only, immutable format
  • Captures source metadata (e.g. filing date, ticker, quarter)

Validation & Data Quality

  • Performs schema validation (using Pandera, Pydantic, or similar)
  • Logs validation errors separately
  • Tracks % completeness, timeliness, and type coercions
  • Version-controls schema (especially for API changes)

Silver Layer (Cleaned Data)

  • Applies typing, null handling, standard formatting
  • Retains data in source-native structure but cleaned
  • Adds metadata (timestamps, validation results, source ID)

Transformation & Enrichment

  • Derives key ratios and standardized metrics (e.g., ROE, EPS)
  • Handles currency normalization (if needed)
  • Adds industry/sector tagging from a reference source
  • Tracks data lineage (what inputs produced which outputs)

Gold Layer & Dimensional Modeling

  • Fact tables contain fully normalized, comparable metrics
  • Dimension tables track stock metadata with SCD2 history
  • All tables have surrogate keys and audit timestamps
  • Fact/dimension relationships conform to star schema

Pipeline Architecture

  • All services are modular and composable
  • Pipeline orchestrated with DAG tool (Airflow, Dagster, Prefect)
  • Logs and metrics for each stage are centrally collected
  • Retry logic and alerting in place for failed steps

Downstream Readiness

  • Tables are partitioned and indexed for fast queries
  • Supports both batch and real-time consumption
  • Has snapshots or slowly changing dimensions to support time travel
  • Feature store integration (optional but powerful for ML use)