Data Warehouse Design - ayaohsu/Personal-Resources GitHub Wiki

Data Warehousing Design

BI Category	Data Model
Basic reporting	Dimensional
Online analytical processing (OLAP)	Dimensional
Predictive analytics	Data mining/specialized
Exploratory analytics	Data mining/specialized

Principles of Dimensionality

Measurement (avg salary) + Context (group or filter) -> data-driven decisions Or, "Fact" + "Dimension"

By: "sliced and grouped" by values of the entire dimension
For: One or more specific values from within the entire dimension

Star Schema And Snowflake Schema

Star schema: all dimensions along a given hierarchy in one dimension table (ex: faculty - department - college, in one table)
Snowflake schema: Each dimension in its own table. One table for each level of a hierarchy.

Natural Key vs Surrogate Key

Natural keys "travel" from source systems with the rest of the data
Best practice: Use surrogate keys in data warehousing, generated by the database itself (or a supplemental "key management" system). Reason: DW remains immune to operational change by not using natural key with business meanings.

Dimensional Modeling

4 Types of Fact Tables

Transaction fact tables
Periodic snapshot fact tables
Accumulating snapshot fact tables
Factless fact tables
- Recording occurrences
- Recording relationships

2 or more facts can be stored in the same fact table if:

Facts available at the same grain (level of detail)
Facts occur simultaneously

Ex: Tuition bill and tuition payment cannot be put together (not simultaneously, different business processes)
Tuition billed amount and activities fees billed amount can be put together.

Primary key of a fact tables is the combination of all foreign keys relating back to dimension tables

Slowly Changing Dimensions

Techniques to manage history within data warehouse

Three main policies for historical data

Overwrite old data; no history retention (type 1) - overwrite
Maintain unlimited history (type 2) - new row
Maintain limited history (type 3) - new column

It is not uncommon to mix multiple slowly changing dimension techniques within the same dimension. When type 1 and type 2 are both used in a dimension, sometimes a type 1 attribute change necessitates updating multiple dimension rows. --- Kimball "The Data Warehouse Toolkit"

ETL Design

Best Practices

All possible operational data from all sources --> "Change Data Capture" --> New and modified data for DW
Process dimension tables before fact tables --> so the foreign keys exist before fact table process
Opportunities for parallel processing

Design decisions

SCD (type 1/2/3)
Star/Snowflake schema
Append/In-Place Update/Complete Replacement/Rolling Append
Fact table type (transaction/periodic snapshots/accumulating snapshots/factless)

"Change Data Capture" techniques

Transactional data timestamps
Database logs
Last resort: database scan-and-compare

Dimension Table Incremental ETL
Step 1: data preparation
Step 2: data transformation
Step 3: process new dimension rows
Step 4: process SCD type 1 changes
Step 5: process SCD type 2 changes