data warehousing concepts - sungov/data-engineering-bible GitHub Wiki

Data Warehousing Concepts

A data warehouse serves as the foundation for analytical systems, enabling organizations to store, integrate, and analyze large volumes of data for business decision-making. This section provides a comprehensive understanding of data warehousing, including its architecture, data modeling schemas, layers like staging, ODS, and data marts, normalization concepts, Kimball vs. Inmon approaches, and a comparison with modern Medallion architecture.

What is a Data Warehouse?

Definition

A data warehouse is a centralized repository designed for querying, reporting, and analysis. Unlike transactional systems (OLTP), data warehouses (OLAP) are optimized for analytical queries, aggregating data from multiple heterogeneous sources.

Key Characteristics

Subject-Oriented: Organized around key business subjects (e.g., sales, customers).
Integrated: Combines data from various sources into a consistent format.
Non-Volatile: Once data is stored, it is not updated or overwritten.
Time-Variant: Stores historical data, enabling trend and time-series analysis.

Why Use a Data Warehouse?

Centralized Data Source:
- Eliminates silos by integrating data from multiple systems.
- Provides a "single source of truth" for business intelligence.
Optimized for Analytics:
- Designed for complex, read-heavy queries.
- Supports advanced analytics, dashboards, and reporting tools.
Historical Analysis:
- Enables trend analysis by retaining historical data.
Compliance and Governance:
- Helps meet regulatory requirements for data storage and access.

Data Warehouse Layers

A typical data warehouse architecture is organized into multiple layers to streamline data flow from raw ingestion to analytical consumption. These layers include:

1. Staging Area

Definition: A temporary space where raw data from source systems is loaded before further processing.
Purpose:
- Acts as a buffer between source systems and the warehouse.
- Allows for data cleansing, deduplication, and schema standardization.
Characteristics:
- Holds raw, unprocessed data.
- Data is often purged after transformation.
Naming Conventions:
- Often referred to as the "RAW" layer (e.g., staging_sales).
- Prefixes like stg_ are common (e.g., stg_customers).
Comparison with Medallion Architecture:
- Similar to the Bronze Layer in Medallion, where raw data is stored.

2. Operational Data Store (ODS)

Definition: A real-time or near-real-time data store that integrates and cleanses data from operational systems for reporting purposes.
Purpose:
- Supports operational reporting and lightweight analytics.
- Serves as an intermediate layer for data before it enters the warehouse.
Characteristics:
- Contains recent, short-term data.
- Data may be updated frequently (unlike the warehouse).
Naming Conventions:
- Prefixes like ods_ or temp_ (e.g., ods_inventory).

Example Flow:

Source Systems → Staging Area → ODS → Data Warehouse

3. Data Warehouse Core

Definition: The centralized storage layer containing processed, historical, and integrated data.
Purpose:
- Optimized for complex analytical queries.
- Stores clean and aggregated data.
Structure:
- Fact and dimension tables organized into schemas (e.g., Star, Snowflake).

4. Data Marts

Definition: Subsets of the data warehouse tailored for specific business domains or teams.
Purpose:
- Provide focused data for specific use cases (e.g., marketing, sales).
- Reduce query complexity for end-users.
Characteristics:
- Denormalized for performance.
- Typically limited in scope compared to the warehouse.
Naming Conventions:
- Prefixes like dm_ (e.g., dm_sales, dm_hr).

Comparison with Medallion Architecture:

Data Marts are analogous to the Gold Layer in Medallion, containing curated and ready-to-consume data.

Data Modeling in Data Warehouses

1. Schemas

Schemas define how data is organized within a warehouse.

Star Schema

Central fact table surrounded by dimension tables.
Optimized for performance and simplicity.
Suitable for BI tools and OLAP queries.

Snowflake Schema

Dimensions are further normalized into related tables.
Reduces redundancy but increases query complexity.

Galaxy Schema (Fact Constellation)

Multiple fact tables sharing common dimensions.
Used in enterprise-level warehouses with multiple business processes.

Dimensions and Fact Tables

1. Dimensions

Provide descriptive attributes for business entities.
Examples: Customer, Product, Time.

2. Fact Tables

Store measurable business metrics (e.g., sales, revenue).
Types:
- Additive Facts: Can be aggregated across all dimensions.
- Semi-Additive Facts: Can be aggregated across some dimensions.
- Non-Additive Facts: Cannot be aggregated (e.g., ratios).

Factless Fact Tables

Contain only foreign keys, no measurable facts.
Example: Tracking student attendance (Date, Student_ID, Class_ID).

Data Normalization and Denormalization

1. Normalization

Organizes data to reduce redundancy.
Common in transactional systems (OLTP).
Normal Forms:
- 1NF: Eliminates duplicate rows; ensures atomic values.
- 2NF: Ensures all non-key attributes depend on the primary key.
- 3NF: Removes transitive dependencies.

2. Denormalization

Combines tables for better performance in analytical queries.
Used in data warehouses (OLAP).

Slowly Changing Dimensions (SCD)

Types of SCD

Type 0: No changes tracked.
Type 1: Overwrites old values.
Type 2: Tracks history by creating new rows.
Type 3: Tracks limited history using new columns.

Kimball vs. Inmon Approaches

1. Kimball Approach

Bottom-Up: Focuses on building data marts first, integrating them into a warehouse.
Advantages:
- Quicker implementation.
- Easy to align with business needs.
Disadvantages:
- Difficult to scale for large enterprises.

2. Inmon Approach

Top-Down: Builds an enterprise data warehouse first, then creates data marts.
Advantages:
- Better for large-scale organizations.
- Ensures consistency across data marts.
Disadvantages:
- Longer implementation time.

Comparison with Medallion Architecture

Feature	Traditional DW Layers	Medallion Architecture
Raw Data	Staging	Bronze
Processed Data	Warehouse Core	Silver
Curated Data	Data Marts	Gold
Flexibility	Limited by schema	High due to Delta Lake
Real-Time Support	Limited	Strong

Flowchart Description

Staging Area:
- Raw data is ingested and cleaned.
- Example: Raw Sales Data → Cleaned Sales Data.
ODS:
- Provides operational reporting capabilities.
- Example: Recent Sales Orders.
Data Warehouse:
- Integrates historical data for analysis.
- Example: Fact and Dimension Tables.
Data Marts:
- Domain-specific analytical datasets.
- Example: Marketing Data Mart for Campaign Analysis.

Conclusion

Data warehousing concepts form the backbone of enterprise analytics. From designing schemas to managing dimensions, understanding these concepts ensures the creation of efficient and scalable data warehouses. Modern approaches like Medallion architecture build on these principles, enhancing flexibility and real-time capabilities.

Next Steps

Explore the next chapter: Data Lakes and Lakehouse Architectures