Data Modeling ‐ Complex Data Types and Cumulation - sachit914/datawarehouse GitHub Wiki

Introduction to Dimensional Data Modeling

Dimensions are attributes of an entity (e.g., birthday, favorite food, city).
Identifier Dimensions: Uniquely identify an entity (e.g., user ID, social security number, device ID).
Attributes: Provide additional details but don't identify the entity (favorite_food,city).

Slowly Changing Dimensions: Attributes that evolve over time (e.g., airport in india ).
Fixed Dimensions: Attributes that don’t change (e.g., device id , manufacturer of a phone).

The Continuum of Data Modeling described here is a framework used in data engineering to understand how data evolves through layers, from its raw, transactional form to high-level metrics.

This is the raw data generated by the operational systems of an application
Stored in normalized tables.
Designed for efficient write operations and quick lookups.
Often messy, fragmented across multiple tables, and challenging to query directly for insights.
Using transactional data directly for analytics often requires joining numerous tables, which can be complex, slow, and error-prone.

To create a unified, clean, and well-organized representation of transactional data. It involves consolidating multiple raw data sources into a single source of truth.
Combining 40+ transactional datasets at Airbnb to create one master dataset for understanding pricing and availability.
Serves as an intermediary between raw data and analytics-ready datasets.
Makes it easier for analysts to perform queries without needing to understand the intricacies of the underlying transactional data.

Stands for Online Analytical Processing. It flattens and denormalizes the master data, structuring it for quick aggregation and multi-dimensional analysis.
Key Features:
- Supports slicing and dicing, allowing users to explore data along various dimensions.
Primary Users:
- Group-bys.
- Aggregations.
- Trend analyses.

Metrics are summaries of complex data, turned into simple, clear insights that help people make decisions.
Characteristics:
- Derived from OLAP cubes.
- Often pre-computed and stored for quick retrieval (e.g., dashboards, reports).
- Designed for decision-making rather than exploration.

Compactness:

Issue:

Happens when time-based data modeling creates excessively large datasets.
Example: Recording a value for every second of a day can lead to massive data growth. Solutions:
Aggregate data at higher levels (e.g., hourly or daily instead of second-by-second).

A compression technique that stores consecutive identical values as a single value with a count.
Example: Instead of storing [A, A, A, B, B], store [(A, 3), (B, 2)]. Use Case:
Works well with columnar storage formats like Parquet.
Significantly reduces storage requirements for repetitive data.