Data Warehouse Fundamentals - ayaohsu/Personal-Resources GitHub Wiki

Data Warehousing Concepts

Data warehouse: a warehouse filled with data!

A data warehouse is a central repository of information that can be analyzed to make more informed decisions. It is designed to support business intelligence activities.
So the database is the platform and the data warehouse is the usage. Data warehouse is built upon database.

Concepts by Bill Inmon (~1990)

Integrated: Data is from a number of sources
Subject oriented: Regardless of sources, it should be organized by subject
Time variant: It contains historical data
Non volatile: Data warehouse remain stable between refreshes

Reasons for data warehousing

Making data-driven decisions
One stop shopping (since data is scattered all over the place)

Data Warehouse vs Data Lake

Data warehouse is built upon RDBMS. Whereas data lake is built upon big data environment. Some people think data lakes as next generation data warehouse.

Volume: data lake contains much more volume
Velocity: data lake supports more rapidly data intake
Variety: data warehouse supports structured data, but big data supports both structure/unstructured data

Data Warehousing vs Data Virtualization

Data virtualization is essentially read-only distributed DBMS. It is in-place data access, unlike data warehouse.

End-to-End DW Environment

data sources ---> ETL (Extract/Transform/Load) ---> data warehouse ---> ETL ---> Data Marts
Suppliers -> Wholesaler -> Retailers

Data Warehousing Architecture

Centralized Data Warehouse

Single database
One stop shopping
One challenge is the high cross-org cooperation, high data governance for such a highly centralized architecture.

Component-Based Data Warehouse

Decomposition
Mix-and-match technology
Bolt together components
Overcome org. challenges
Challenges: often inconsistent data, difficult to cross-integrate

"Cube"

Cube = Multidimensional database (MDBMS)

Operational Data Store

ODS integrates data from multiple sources, but its emphasis is on current operational data, and it's often real-time source.
"Tell me what is happening right now"
Strategical decision making vs operational decision making
Two options: parallel ETL pipeline or treat ODS as a stage (and feed to DW)
However, this is less popular now since current DWs are faster, and we have data lake

Best Value

Business Intelligence + {Data Warehousing and/or Data Lakes and/or Data Virtualization and/or Operational Data Store} = BEST VALUE

Staging Layer

Staging Layer: The Extraction part (1 to 1 mapping from the source)

Non-persistent: empty -> extract -> load (to UAL) -> empty
Persistent: Data is not cleaned up after loading to UAL

User Access Layer

Bring Data Into Your Data Warehouse

Extract:

Quickly pull data from source applications
Traditionally done in "batches"
Raw data
Land in staging layer

Transform:

Apples to apples
Prepare for uniform data in user access layer
Can be very complex

Load:

Store uniform data in user access layer

Challenges: significant business analysis/data modeling BEFORE storing data

ELT (compared with ETL)

Blast data into big data environment
Use big data environment computing power to transform when needed

Initial Load ETL

Normally one time only, right before the data warehouse goes live
Bring in all relevant data necessary for BI and analytics
- data definitely needed for BI and analytics
- data probably needed for BI and analytics
- historical data

Incremental ETL

Incrementally "refreshes" the data warehouse
New data: employees, etc.
Modified data: employee promotions, etc.
Special handling for deleted data: customer drops from a subscription plan, etc.

Four major incremental ETL patterns:

Append
In-place update
Complete replacement
Rolling append

ETL today will use append and in-place update mostly

Role of Data Transformation

Uniformity
Restructuring

Common data transformation models:

Data value unification (ex: cm, ft -> cm)
Data type and size unification (char(3) and char(20))
De-duplication (ensure no double counting)
Vertical slicing (dropping columns)
Horizontal slicing (value-based row filtering)
Correcting known errors (correcting invalid values)