Data Warehouse Fundamentals - ayaohsu/Personal-Resources GitHub Wiki
Data Warehousing Concepts
Data warehouse: a warehouse filled with data!
A data warehouse is a central repository of information that can be analyzed to make more informed decisions. It is designed to support business intelligence activities.
So the database is the platform and the data warehouse is the usage. Data warehouse is built upon database.
Concepts by Bill Inmon (~1990)
- Integrated: Data is from a number of sources
- Subject oriented: Regardless of sources, it should be organized by subject
- Time variant: It contains historical data
- Non volatile: Data warehouse remain stable between refreshes
Reasons for data warehousing
- Making data-driven decisions
- One stop shopping (since data is scattered all over the place)
Data Warehouse vs Data Lake
Data warehouse is built upon RDBMS. Whereas data lake is built upon big data environment. Some people think data lakes as next generation data warehouse.
- Volume: data lake contains much more volume
- Velocity: data lake supports more rapidly data intake
- Variety: data warehouse supports structured data, but big data supports both structure/unstructured data
Data Warehousing vs Data Virtualization
Data virtualization is essentially read-only distributed DBMS. It is in-place data access, unlike data warehouse.
End-to-End DW Environment
data sources ---> ETL (Extract/Transform/Load) ---> data warehouse ---> ETL ---> Data Marts
Suppliers -> Wholesaler -> Retailers
Data Warehousing Architecture
Centralized Data Warehouse
- Single database
- One stop shopping
One challenge is the high cross-org cooperation, high data governance for such a highly centralized architecture.
Component-Based Data Warehouse
- Decomposition
- Mix-and-match technology
- Bolt together components
- Overcome org. challenges
Challenges: often inconsistent data, difficult to cross-integrate
"Cube"
Cube = Multidimensional database (MDBMS)
Operational Data Store
ODS integrates data from multiple sources, but its emphasis is on current operational data, and it's often real-time source.
"Tell me what is happening right now"
Strategical decision making vs operational decision making
Two options: parallel ETL pipeline or treat ODS as a stage (and feed to DW)
However, this is less popular now since current DWs are faster, and we have data lake
Best Value
Business Intelligence + {Data Warehousing and/or Data Lakes and/or Data Virtualization and/or Operational Data Store} = BEST VALUE
Staging Layer
- Staging Layer: The Extraction part (1 to 1 mapping from the source)
- Non-persistent: empty -> extract -> load (to UAL) -> empty
- Persistent: Data is not cleaned up after loading to UAL
- User Access Layer
Bring Data Into Your Data Warehouse
Extract:
- Quickly pull data from source applications
- Traditionally done in "batches"
- Raw data
- Land in staging layer
Transform:
- Apples to apples
- Prepare for uniform data in user access layer
- Can be very complex
Load:
- Store uniform data in user access layer
Challenges: significant business analysis/data modeling BEFORE storing data
ELT (compared with ETL)
- Blast data into big data environment
- Use big data environment computing power to transform when needed
Initial Load ETL
- Normally one time only, right before the data warehouse goes live
- Bring in all relevant data necessary for BI and analytics
- data definitely needed for BI and analytics
- data probably needed for BI and analytics
- historical data
Incremental ETL
- Incrementally "refreshes" the data warehouse
- New data: employees, etc.
- Modified data: employee promotions, etc.
- Special handling for deleted data: customer drops from a subscription plan, etc.
Four major incremental ETL patterns:
- Append
- In-place update
- Complete replacement
- Rolling append
ETL today will use append and in-place update mostly
Role of Data Transformation
- Uniformity
- Restructuring
Common data transformation models:
- Data value unification (ex: cm, ft -> cm)
- Data type and size unification (char(3) and char(20))
- De-duplication (ensure no double counting)
- Vertical slicing (dropping columns)
- Horizontal slicing (value-based row filtering)
- Correcting known errors (correcting invalid values)