Data Warehouse Fundamentals - ayaohsu/Personal-Resources GitHub Wiki

Data Warehousing Concepts

Data warehouse: a warehouse filled with data!

A data warehouse is a central repository of information that can be analyzed to make more informed decisions. It is designed to support business intelligence activities.
So the database is the platform and the data warehouse is the usage. Data warehouse is built upon database.

Concepts by Bill Inmon (~1990)

  • Integrated: Data is from a number of sources
  • Subject oriented: Regardless of sources, it should be organized by subject
  • Time variant: It contains historical data
  • Non volatile: Data warehouse remain stable between refreshes

Reasons for data warehousing

  1. Making data-driven decisions
  2. One stop shopping (since data is scattered all over the place)

Data Warehouse vs Data Lake

Data warehouse is built upon RDBMS. Whereas data lake is built upon big data environment. Some people think data lakes as next generation data warehouse.

  1. Volume: data lake contains much more volume
  2. Velocity: data lake supports more rapidly data intake
  3. Variety: data warehouse supports structured data, but big data supports both structure/unstructured data

Data Warehousing vs Data Virtualization

Data virtualization is essentially read-only distributed DBMS. It is in-place data access, unlike data warehouse.

End-to-End DW Environment

data sources ---> ETL (Extract/Transform/Load) ---> data warehouse ---> ETL ---> Data Marts
Suppliers -> Wholesaler -> Retailers

Data Warehousing Architecture

Centralized Data Warehouse

  • Single database
  • One stop shopping
    One challenge is the high cross-org cooperation, high data governance for such a highly centralized architecture.

Component-Based Data Warehouse

  • Decomposition
  • Mix-and-match technology
  • Bolt together components
  • Overcome org. challenges
    Challenges: often inconsistent data, difficult to cross-integrate

"Cube"

Cube = Multidimensional database (MDBMS)

Operational Data Store

ODS integrates data from multiple sources, but its emphasis is on current operational data, and it's often real-time source.
"Tell me what is happening right now"
Strategical decision making vs operational decision making
Two options: parallel ETL pipeline or treat ODS as a stage (and feed to DW)
However, this is less popular now since current DWs are faster, and we have data lake

Best Value

Business Intelligence + {Data Warehousing and/or Data Lakes and/or Data Virtualization and/or Operational Data Store} = BEST VALUE

Staging Layer

  1. Staging Layer: The Extraction part (1 to 1 mapping from the source)
  • Non-persistent: empty -> extract -> load (to UAL) -> empty
  • Persistent: Data is not cleaned up after loading to UAL
  1. User Access Layer

Bring Data Into Your Data Warehouse

Extract:

  • Quickly pull data from source applications
  • Traditionally done in "batches"
  • Raw data
  • Land in staging layer

Transform:

  • Apples to apples
  • Prepare for uniform data in user access layer
  • Can be very complex

Load:

  • Store uniform data in user access layer

Challenges: significant business analysis/data modeling BEFORE storing data

ELT (compared with ETL)

  • Blast data into big data environment
  • Use big data environment computing power to transform when needed

Initial Load ETL

  • Normally one time only, right before the data warehouse goes live
  • Bring in all relevant data necessary for BI and analytics
    • data definitely needed for BI and analytics
    • data probably needed for BI and analytics
    • historical data

Incremental ETL

  • Incrementally "refreshes" the data warehouse
  • New data: employees, etc.
  • Modified data: employee promotions, etc.
  • Special handling for deleted data: customer drops from a subscription plan, etc.

Four major incremental ETL patterns:

  • Append
  • In-place update
  • Complete replacement
  • Rolling append

ETL today will use append and in-place update mostly

Role of Data Transformation

  • Uniformity
  • Restructuring

Common data transformation models:

  • Data value unification (ex: cm, ft -> cm)
  • Data type and size unification (char(3) and char(20))
  • De-duplication (ensure no double counting)
  • Vertical slicing (dropping columns)
  • Horizontal slicing (value-based row filtering)
  • Correcting known errors (correcting invalid values)