data warehousing concepts - sungov/data-engineering-bible GitHub Wiki
Data Warehousing Concepts
A data warehouse serves as the foundation for analytical systems, enabling organizations to store, integrate, and analyze large volumes of data for business decision-making. This section provides a comprehensive understanding of data warehousing, including its architecture, data modeling schemas, layers like staging, ODS, and data marts, normalization concepts, Kimball vs. Inmon approaches, and a comparison with modern Medallion architecture.
What is a Data Warehouse?
Definition
A data warehouse is a centralized repository designed for querying, reporting, and analysis. Unlike transactional systems (OLTP), data warehouses (OLAP) are optimized for analytical queries, aggregating data from multiple heterogeneous sources.
Key Characteristics
- Subject-Oriented: Organized around key business subjects (e.g., sales, customers).
- Integrated: Combines data from various sources into a consistent format.
- Non-Volatile: Once data is stored, it is not updated or overwritten.
- Time-Variant: Stores historical data, enabling trend and time-series analysis.
Why Use a Data Warehouse?
-
Centralized Data Source:
- Eliminates silos by integrating data from multiple systems.
- Provides a "single source of truth" for business intelligence.
-
Optimized for Analytics:
- Designed for complex, read-heavy queries.
- Supports advanced analytics, dashboards, and reporting tools.
-
Historical Analysis:
- Enables trend analysis by retaining historical data.
-
Compliance and Governance:
- Helps meet regulatory requirements for data storage and access.
Data Warehouse Layers
A typical data warehouse architecture is organized into multiple layers to streamline data flow from raw ingestion to analytical consumption. These layers include:
1. Staging Area
- Definition: A temporary space where raw data from source systems is loaded before further processing.
- Purpose:
- Acts as a buffer between source systems and the warehouse.
- Allows for data cleansing, deduplication, and schema standardization.
- Characteristics:
- Holds raw, unprocessed data.
- Data is often purged after transformation.
- Naming Conventions:
- Often referred to as the "RAW" layer (e.g.,
staging_sales
). - Prefixes like
stg_
are common (e.g.,stg_customers
).
- Often referred to as the "RAW" layer (e.g.,
- Comparison with Medallion Architecture:
- Similar to the Bronze Layer in Medallion, where raw data is stored.
2. Operational Data Store (ODS)
- Definition: A real-time or near-real-time data store that integrates and cleanses data from operational systems for reporting purposes.
- Purpose:
- Supports operational reporting and lightweight analytics.
- Serves as an intermediate layer for data before it enters the warehouse.
- Characteristics:
- Contains recent, short-term data.
- Data may be updated frequently (unlike the warehouse).
- Naming Conventions:
- Prefixes like
ods_
ortemp_
(e.g.,ods_inventory
).
- Prefixes like
Example Flow:
Source Systems → Staging Area → ODS → Data Warehouse
3. Data Warehouse Core
- Definition: The centralized storage layer containing processed, historical, and integrated data.
- Purpose:
- Optimized for complex analytical queries.
- Stores clean and aggregated data.
- Structure:
- Fact and dimension tables organized into schemas (e.g., Star, Snowflake).
4. Data Marts
- Definition: Subsets of the data warehouse tailored for specific business domains or teams.
- Purpose:
- Provide focused data for specific use cases (e.g., marketing, sales).
- Reduce query complexity for end-users.
- Characteristics:
- Denormalized for performance.
- Typically limited in scope compared to the warehouse.
- Naming Conventions:
- Prefixes like
dm_
(e.g.,dm_sales
,dm_hr
).
- Prefixes like
Comparison with Medallion Architecture:
- Data Marts are analogous to the Gold Layer in Medallion, containing curated and ready-to-consume data.
Data Modeling in Data Warehouses
1. Schemas
Schemas define how data is organized within a warehouse.
Star Schema
- Central fact table surrounded by dimension tables.
- Optimized for performance and simplicity.
- Suitable for BI tools and OLAP queries.
Snowflake Schema
- Dimensions are further normalized into related tables.
- Reduces redundancy but increases query complexity.
Galaxy Schema (Fact Constellation)
- Multiple fact tables sharing common dimensions.
- Used in enterprise-level warehouses with multiple business processes.
Dimensions and Fact Tables
1. Dimensions
- Provide descriptive attributes for business entities.
- Examples: Customer, Product, Time.
2. Fact Tables
- Store measurable business metrics (e.g., sales, revenue).
- Types:
- Additive Facts: Can be aggregated across all dimensions.
- Semi-Additive Facts: Can be aggregated across some dimensions.
- Non-Additive Facts: Cannot be aggregated (e.g., ratios).
Factless Fact Tables
- Contain only foreign keys, no measurable facts.
- Example: Tracking student attendance (Date, Student_ID, Class_ID).
Data Normalization and Denormalization
1. Normalization
- Organizes data to reduce redundancy.
- Common in transactional systems (OLTP).
- Normal Forms:
- 1NF: Eliminates duplicate rows; ensures atomic values.
- 2NF: Ensures all non-key attributes depend on the primary key.
- 3NF: Removes transitive dependencies.
2. Denormalization
- Combines tables for better performance in analytical queries.
- Used in data warehouses (OLAP).
Slowly Changing Dimensions (SCD)
Types of SCD
- Type 0: No changes tracked.
- Type 1: Overwrites old values.
- Type 2: Tracks history by creating new rows.
- Type 3: Tracks limited history using new columns.
Kimball vs. Inmon Approaches
1. Kimball Approach
- Bottom-Up: Focuses on building data marts first, integrating them into a warehouse.
- Advantages:
- Quicker implementation.
- Easy to align with business needs.
- Disadvantages:
- Difficult to scale for large enterprises.
2. Inmon Approach
- Top-Down: Builds an enterprise data warehouse first, then creates data marts.
- Advantages:
- Better for large-scale organizations.
- Ensures consistency across data marts.
- Disadvantages:
- Longer implementation time.
Comparison with Medallion Architecture
Feature | Traditional DW Layers | Medallion Architecture |
---|---|---|
Raw Data | Staging | Bronze |
Processed Data | Warehouse Core | Silver |
Curated Data | Data Marts | Gold |
Flexibility | Limited by schema | High due to Delta Lake |
Real-Time Support | Limited | Strong |
Flowchart Description
- Staging Area:
- Raw data is ingested and cleaned.
- Example: Raw Sales Data → Cleaned Sales Data.
- ODS:
- Provides operational reporting capabilities.
- Example: Recent Sales Orders.
- Data Warehouse:
- Integrates historical data for analysis.
- Example: Fact and Dimension Tables.
- Data Marts:
- Domain-specific analytical datasets.
- Example: Marketing Data Mart for Campaign Analysis.
Conclusion
Data warehousing concepts form the backbone of enterprise analytics. From designing schemas to managing dimensions, understanding these concepts ensures the creation of efficient and scalable data warehouses. Modern approaches like Medallion architecture build on these principles, enhancing flexibility and real-time capabilities.
Next Steps
Explore the next chapter: Data Lakes and Lakehouse Architectures