Data Warehouse Design - ayaohsu/Personal-Resources GitHub Wiki

Data Warehousing Design

BI Category Data Model
Basic reporting Dimensional
Online analytical processing (OLAP) Dimensional
Predictive analytics Data mining/specialized
Exploratory analytics Data mining/specialized

Principles of Dimensionality

Measurement (avg salary) + Context (group or filter) -> data-driven decisions Or, "Fact" + "Dimension"

By: "sliced and grouped" by values of the entire dimension
For: One or more specific values from within the entire dimension

Star Schema And Snowflake Schema

Star schema: all dimensions along a given hierarchy in one dimension table (ex: faculty - department - college, in one table)
Snowflake schema: Each dimension in its own table. One table for each level of a hierarchy.

Natural Key vs Surrogate Key

Natural keys "travel" from source systems with the rest of the data
Best practice: Use surrogate keys in data warehousing, generated by the database itself (or a supplemental "key management" system). Reason: DW remains immune to operational change by not using natural key with business meanings.

Dimensional Modeling

4 Types of Fact Tables

  • Transaction fact tables
  • Periodic snapshot fact tables
  • Accumulating snapshot fact tables
  • Factless fact tables
    • Recording occurrences
    • Recording relationships

2 or more facts can be stored in the same fact table if:

  1. Facts available at the same grain (level of detail)
  2. Facts occur simultaneously

Ex: Tuition bill and tuition payment cannot be put together (not simultaneously, different business processes)
Tuition billed amount and activities fees billed amount can be put together.

Primary key of a fact tables is the combination of all foreign keys relating back to dimension tables

Slowly Changing Dimensions

Techniques to manage history within data warehouse

Three main policies for historical data

  • Overwrite old data; no history retention (type 1) - overwrite
  • Maintain unlimited history (type 2) - new row
  • Maintain limited history (type 3) - new column

It is not uncommon to mix multiple slowly changing dimension techniques within the same dimension. When type 1 and type 2 are both used in a dimension, sometimes a type 1 attribute change necessitates updating multiple dimension rows. --- Kimball "The Data Warehouse Toolkit"

ETL Design

Best Practices

  • All possible operational data from all sources --> "Change Data Capture" --> New and modified data for DW
  • Process dimension tables before fact tables --> so the foreign keys exist before fact table process
  • Opportunities for parallel processing

Design decisions

  • SCD (type 1/2/3)
  • Star/Snowflake schema
  • Append/In-Place Update/Complete Replacement/Rolling Append
  • Fact table type (transaction/periodic snapshots/accumulating snapshots/factless)

"Change Data Capture" techniques

  • Transactional data timestamps
  • Database logs
  • Last resort: database scan-and-compare

Dimension Table Incremental ETL
Step 1: data preparation
Step 2: data transformation
Step 3: process new dimension rows
Step 4: process SCD type 1 changes
Step 5: process SCD type 2 changes