Data Modeling ‐ Complex Data Types and Cumulation - sachit914/datawarehouse GitHub Wiki

refernce: https://www.youtube.com/watch?v=5U-BbZ9G_xU&t=1244s

Introduction to Dimensional Data Modeling

  1. Complex Data Types:
  • Array: Think of it as a list in a column.
  • Struct: Similar to a table within a table.
  • Combining these (e.g., arrays of structs) can make data sets more compact
  1. Dimensions:
  • Dimensions are attributes of an entity (e.g., birthday, favorite food, city).
  • Identifier Dimensions: Uniquely identify an entity (e.g., user ID, social security number, device ID).
  • Attributes: Provide additional details but don't identify the entity (favorite_food,city).
  1. Types of Dimensions:
  • Slowly Changing Dimensions: Attributes that evolve over time (e.g., airport in india ).
  • Fixed Dimensions: Attributes that don’t change (e.g., device id , manufacturer of a phone).

Continuum of Data Modeling

image

  • The Continuum of Data Modeling described here is a framework used in data engineering to understand how data evolves through layers, from its raw, transactional form to high-level metrics.

1. Transactional Data

  • This is the raw data generated by the operational systems of an application
  • Stored in normalized tables.
  • Designed for efficient write operations and quick lookups.
  • Often messy, fragmented across multiple tables, and challenging to query directly for insights.
  • Using transactional data directly for analytics often requires joining numerous tables, which can be complex, slow, and error-prone.

2. Master Data

  • To create a unified, clean, and well-organized representation of transactional data. It involves consolidating multiple raw data sources into a single source of truth.
  • Combining 40+ transactional datasets at Airbnb to create one master dataset for understanding pricing and availability.
  • Serves as an intermediary between raw data and analytics-ready datasets.
  • Makes it easier for analysts to perform queries without needing to understand the intricacies of the underlying transactional data.

3. OLAP Cube

  • Stands for Online Analytical Processing. It flattens and denormalizes the master data, structuring it for quick aggregation and multi-dimensional analysis.
  • Key Features:
    • Supports slicing and dicing, allowing users to explore data along various dimensions.
  • Primary Users:
    • Group-bys.
    • Aggregations.
    • Trend analyses.

4. Metrics

  • Metrics are summaries of complex data, turned into simple, clear insights that help people make decisions.
  • Characteristics:
    • Derived from OLAP cubes.
    • Often pre-computed and stored for quick retrieval (e.g., dashboards, reports).
    • Designed for decision-making rather than exploration.

summary

  • Transactional Data: Messy, raw, numerous tables.
  • Master Data: Cleaned, unified, structured (one table or fewer).
  • OLAP Cube: Flattened, multi-dimensional, ready for exploration.
  • Metrics: Simple, distilled, decision-oriented,pre-computed data.

Cumulative Table Design

  • A table that accumulates data over time to provide a historical record.
  • example: A table tracking daily website traffic with a running total.
  • Importance:
    • Enables tracking trends and changes over time.

Compactness vs. Usability Trade-off

Compactness:

  • Reducing data size using advanced structures (arrays, structs, compression).
  • Example: Shrinking datasets using arrays of structs. Usability:
  • Ensuring data is easy to query and analyze.
  • Compact data often sacrifices usability due to its complexity.

Temporal Cardinality Explosion

Issue:

  • Happens when time-based data modeling creates excessively large datasets.
  • Example: Recording a value for every second of a day can lead to massive data growth. Solutions:
  • Aggregate data at higher levels (e.g., hourly or daily instead of second-by-second).

Run-Length Encoding

  • A compression technique that stores consecutive identical values as a single value with a count.
  • Example: Instead of storing [A, A, A, B, B], store [(A, 3), (B, 2)]. Use Case:
  • Works well with columnar storage formats like Parquet.
  • Significantly reduces storage requirements for repetitive data.