Data Pipeline - toddbadams/strawberry GitHub Wiki
Data Warehouse Pipeline
To support reliable analytics and modeling on quarterly financial data, we’ve designed a robust, layered data pipeline based on medallion architecture principles. This pipeline ingests raw API data, validates and enriches it, and ultimately transforms it into star-schema fact and dimension tables within a central data warehouse. The diagram below outlines each major stage of the pipeline, followed by a detailed breakdown of its components and responsibilities.
flowchart TD
subgraph Acquisition
A[Raw API Data] --> B[Acquisition Service]
B --> C[Bronze Layer - Raw Storage]
end
subgraph Validation
C --> D[Validation Service]
D --> E[Silver Layer - Validated / Typed]
D -->|Errors| DX[Data Quality Logs]
end
subgraph Transformation
E --> F[Transformation & Enrichment Service]
F --> G[Gold Layer - Modeled Tables]
F -->|Audit Metadata| MX[Metadata Registry]
end
subgraph Modeling
G --> H[Dimension Services - Stocks, Sectors, Markets]
G --> I[Fact Services - Financials]
H --> J[Dimension Tables]
I --> K[Fact Tables]
end
J --> L[Data Warehouse - Star Schema]
K --> L
subgraph Downstream
L --> M[Analytics & Dashboards]
L --> N[ML Feature Store]
L --> O[Export to External Consumers]
end
Acquisition
-
Raw API Data: Quarterly financial data is fetched directly from a third-party source (e.g., EDGAR, Alpha Vantage APIs).
-
Acquisition Service: Handles pulling the data, logging source metadata (timestamp, ticker, etc.).
-
Bronze Layer – Raw Storage: Stores unmodified, original data for traceability and reprocessing.
-
Sources: Alpha Vantage APIs
-
Tools: Python extract scripts in the acquisition folder
-
Storage: Raw data stored in parquet format
-
Orchestration: Prefect DAGs
-
Folder:
acquisition
Table | Partition |
---|---|
BALANCE_SHEET | symbol |
CASH_FLOW | symbol |
DIVIDENDS | symbol |
EARNINGS | symbol |
INCOME_STATEMENT | symbol |
INSIDER_TRANSACTIONS | symbol |
OVERVIEW | - |
TIME_SERIES_MONTHLY_ADJUSTED | symbol |
Validation
-
Validation Service: Applies schema checks, data typing, null handling, and basic cleanup.
-
Silver Layer – Validated/Typed: Cleaned and standardized data; safe for transformation and modeling.
-
Data Quality Logs: Captures issues like missing fields, type mismatches, or late/malformed data for auditing and alerting.
-
Sources: Alpha Vantage APIs
-
Tools: Python extract scripts in the acquisition folder and saves to the validation folder
-
Cleaning: fill or impute missing values, dedupe records, standardize formats
-
Storage: Formatted data (date/time, float, int, str) stored in parquet format
-
Orchestration: Prefect DAGs
-
Folder:
validated
Table | Partition |
---|---|
BALANCE_SHEET | symbol |
CASH_FLOW | symbol |
DIVIDENDS | symbol |
EARNINGS | symbol |
INCOME_STATEMENT | symbol |
INSIDER_TRANSACTIONS | symbol |
OVERVIEW | - |
TIME_SERIES_MONTHLY_ADJUSTED | symbol |
Transformation
- Transformation & Enrichment Service: Derives new metrics (e.g., ROE, EBITDA margin), performs currency normalization, standardizes structures.
- Gold Layer – Modeled Tables: Fully enriched, analysis-ready data structured in a consistent format across companies.
- Metadata Registry: Logs lineage, transformations applied, column-level metadata, and audit trail for governance.
Modeling
-
Dimension Services: Creates entity reference tables—stocks, sectors, markets—using SCD logic where necessary.
-
Fact Services: Builds tables containing numeric, time-series data like revenue, profit, cash flow, etc.
-
Dimension Tables: Entities and descriptors (e.g., company names, sectors, geography).
-
Fact Tables: Measures tied to dimensions and time (e.g., Q2 2025 revenue for MSFT).
-
Enrichment: joins, aggregations, type conversions using python in the transformation folder
-
Feature engineering: date parts, rolling statistics, column level calculators
-
Orchestration: Prefect DAGs
-
Ratios:
-
AlphaPulse: profitability (ROA), growth, leverage, valuation (earnings yield), momentum, stability and a weighted composite
-
Dividend Safety: payout ratios (earnings & FCF), leverage, coverage metrics, volatility, streaks, drawdowns and a composite safety score
-
Folder:
transformed
Table | Partition |
---|---|
DIM_STOCK | - |
FACT_QTR_FINANCIALS | symbol |
- Dimension = the “who/what/where/when” context. Descriptive attributes (names, categories, hierarchies).
- Fact = the “how many/how much” measurements. Numeric metrics tied to a specific grain (event) and foreign keys to dimensions.
Dimension | Fact | |
---|---|---|
Role | Describes context (nouns) | Stores measures/events (verbs, numbers) |
Examples | dim_customer , dim_date , dim_security |
fact_sales , fact_trades , fact_price_daily |
Columns | Text/labels, hierarchies, surrogate keys | Foreign keys to dims + numeric measures |
Size/Change | Smaller, slowly changing (SCD1/2/etc.) | Very large, mostly insert-only/appended |
Grain | One row per entity at its natural grain | One row per event/observation at the declared grain |
Additivity | Not applicable | Measures are additive, semi-additive, or non-additive |
How to tell which is which
- If you’re asking “by what/along what axis do I slice?” → It’s a dimension.
- If you’re asking “how many/how much did we…” → It’s a fact.
- If it mostly holds numbers you aggregate (sum, avg, min/max) → fact.
- If it mostly holds descriptions (name, type, sector, category) → dimension.
Example (stocks domain)
- Dimension:
dim_stock
(ticker, company name, sector, currency, IPO date, …) - Fact:
fact_price_daily
(ticker_key, date_key, open, high, low, close, volume)
Nuances
- Factless fact tables: no measures, just the occurrence of an event (e.g., a student attended a class).
- Bridge tables: handle many‑to‑many between facts and dimensions (e.g., trade ↔ multiple brokers).
- Degenerate dimensions: IDs sitting in the fact table itself (e.g., transaction_id as a textual attribute).
Data Warehouse – Star Schema
- Combines fact and dimension tables into a star schema optimized for analytics, slicing/dicing, and time-series comparison.
Downstream Consumption
- Analytics & Dashboards: Power BI, Tableau, or custom Streamlit apps use this data for executive dashboards and KPI tracking.
- ML Feature Store: Provides cleaned, ready-to-use features to machine learning models (e.g., for scoring or prediction).
- Export to External Consumers: Enables pushing data to clients, partners, or reporting systems (e.g., CSVs, APIs, S3, etc.).
Data Warehouse Build Checklist
Here’s a checklist you can use to audit and improve your process:
Ingestion & Raw Storage
- Ingests all raw API data in original schema
- Stores raw data in append-only, immutable format
- Captures source metadata (e.g. filing date, ticker, quarter)
Validation & Data Quality
- Performs schema validation (using Pandera, Pydantic, or similar)
- Logs validation errors separately
- Tracks % completeness, timeliness, and type coercions
- Version-controls schema (especially for API changes)
Silver Layer (Cleaned Data)
- Applies typing, null handling, standard formatting
- Retains data in source-native structure but cleaned
- Adds metadata (timestamps, validation results, source ID)
Transformation & Enrichment
- Derives key ratios and standardized metrics (e.g., ROE, EPS)
- Handles currency normalization (if needed)
- Adds industry/sector tagging from a reference source
- Tracks data lineage (what inputs produced which outputs)
Gold Layer & Dimensional Modeling
- Fact tables contain fully normalized, comparable metrics
- Dimension tables track stock metadata with SCD2 history
- All tables have surrogate keys and audit timestamps
- Fact/dimension relationships conform to star schema
Pipeline Architecture
- All services are modular and composable
- Pipeline orchestrated with DAG tool (Airflow, Dagster, Prefect)
- Logs and metrics for each stage are centrally collected
- Retry logic and alerting in place for failed steps
Downstream Readiness
- Tables are partitioned and indexed for fast queries
- Supports both batch and real-time consumption
- Has snapshots or slowly changing dimensions to support time travel
- Feature store integration (optional but powerful for ML use)