Data Warehouse Design - ayaohsu/Personal-Resources GitHub Wiki
Data Warehousing Design
BI Category | Data Model |
---|---|
Basic reporting | Dimensional |
Online analytical processing (OLAP) | Dimensional |
Predictive analytics | Data mining/specialized |
Exploratory analytics | Data mining/specialized |
Principles of Dimensionality
Measurement (avg salary) + Context (group or filter) -> data-driven decisions Or, "Fact" + "Dimension"
By: "sliced and grouped" by values of the entire dimension
For: One or more specific values from within the entire dimension
Star Schema And Snowflake Schema
Star schema: all dimensions along a given hierarchy in one dimension table (ex: faculty - department - college, in one table)
Snowflake schema: Each dimension in its own table. One table for each level of a hierarchy.
Natural Key vs Surrogate Key
Natural keys "travel" from source systems with the rest of the data
Best practice: Use surrogate keys in data warehousing, generated by the database itself (or a supplemental "key management" system). Reason: DW remains immune to operational change by not using natural key with business meanings.
Dimensional Modeling
4 Types of Fact Tables
- Transaction fact tables
- Periodic snapshot fact tables
- Accumulating snapshot fact tables
- Factless fact tables
- Recording occurrences
- Recording relationships
2 or more facts can be stored in the same fact table if:
- Facts available at the same grain (level of detail)
- Facts occur simultaneously
Ex:
Tuition bill and tuition payment cannot be put together (not simultaneously, different business processes)
Tuition billed amount and activities fees billed amount can be put together.
Primary key of a fact tables is the combination of all foreign keys relating back to dimension tables
Slowly Changing Dimensions
Techniques to manage history within data warehouse
Three main policies for historical data
- Overwrite old data; no history retention (type 1) - overwrite
- Maintain unlimited history (type 2) - new row
- Maintain limited history (type 3) - new column
It is not uncommon to mix multiple slowly changing dimension techniques within the same dimension. When type 1 and type 2 are both used in a dimension, sometimes a type 1 attribute change necessitates updating multiple dimension rows. --- Kimball "The Data Warehouse Toolkit"
ETL Design
Best Practices
- All possible operational data from all sources --> "Change Data Capture" --> New and modified data for DW
- Process dimension tables before fact tables --> so the foreign keys exist before fact table process
- Opportunities for parallel processing
Design decisions
- SCD (type 1/2/3)
- Star/Snowflake schema
- Append/In-Place Update/Complete Replacement/Rolling Append
- Fact table type (transaction/periodic snapshots/accumulating snapshots/factless)
"Change Data Capture" techniques
- Transactional data timestamps
- Database logs
- Last resort: database scan-and-compare
Dimension Table Incremental ETL
Step 1: data preparation
Step 2: data transformation
Step 3: process new dimension rows
Step 4: process SCD type 1 changes
Step 5: process SCD type 2 changes