Data Modeling ‐ Complex Data Types and Cumulation - sachit914/datawarehouse GitHub Wiki
refernce: https://www.youtube.com/watch?v=5U-BbZ9G_xU&t=1244s
Introduction to Dimensional Data Modeling
- Complex Data Types:
- Array: Think of it as a list in a column.
- Struct: Similar to a table within a table.
- Combining these (e.g., arrays of structs) can make data sets more compact
- Dimensions:
- Dimensions are attributes of an entity (e.g., birthday, favorite food, city).
- Identifier Dimensions: Uniquely identify an entity (e.g., user ID, social security number, device ID).
- Attributes: Provide additional details but don't identify the entity (favorite_food,city).
- Types of Dimensions:
- Slowly Changing Dimensions: Attributes that evolve over time (e.g., airport in india ).
- Fixed Dimensions: Attributes that don’t change (e.g., device id , manufacturer of a phone).
Continuum of Data Modeling
- The Continuum of Data Modeling described here is a framework used in data engineering to understand how data evolves through layers, from its raw, transactional form to high-level metrics.
1. Transactional Data
- This is the raw data generated by the operational systems of an application
- Stored in normalized tables.
- Designed for efficient write operations and quick lookups.
- Often messy, fragmented across multiple tables, and challenging to query directly for insights.
- Using transactional data directly for analytics often requires joining numerous tables, which can be complex, slow, and error-prone.
2. Master Data
- To create a unified, clean, and well-organized representation of transactional data. It involves consolidating multiple raw data sources into a single source of truth.
- Combining 40+ transactional datasets at Airbnb to create one master dataset for understanding pricing and availability.
- Serves as an intermediary between raw data and analytics-ready datasets.
- Makes it easier for analysts to perform queries without needing to understand the intricacies of the underlying transactional data.
3. OLAP Cube
- Stands for Online Analytical Processing. It flattens and denormalizes the master data, structuring it for quick aggregation and multi-dimensional analysis.
- Key Features:
- Supports slicing and dicing, allowing users to explore data along various dimensions.
- Primary Users:
- Group-bys.
- Aggregations.
- Trend analyses.
4. Metrics
- Metrics are summaries of complex data, turned into simple, clear insights that help people make decisions.
- Characteristics:
- Derived from OLAP cubes.
- Often pre-computed and stored for quick retrieval (e.g., dashboards, reports).
- Designed for decision-making rather than exploration.
summary
- Transactional Data: Messy, raw, numerous tables.
- Master Data: Cleaned, unified, structured (one table or fewer).
- OLAP Cube: Flattened, multi-dimensional, ready for exploration.
- Metrics: Simple, distilled, decision-oriented,pre-computed data.
Cumulative Table Design
- A table that accumulates data over time to provide a historical record.
- example: A table tracking daily website traffic with a running total.
- Importance:
- Enables tracking trends and changes over time.
Compactness vs. Usability Trade-off
Compactness:
- Reducing data size using advanced structures (arrays, structs, compression).
- Example: Shrinking datasets using arrays of structs. Usability:
- Ensuring data is easy to query and analyze.
- Compact data often sacrifices usability due to its complexity.
Temporal Cardinality Explosion
Issue:
- Happens when time-based data modeling creates excessively large datasets.
- Example: Recording a value for every second of a day can lead to massive data growth. Solutions:
- Aggregate data at higher levels (e.g., hourly or daily instead of second-by-second).
Run-Length Encoding
- A compression technique that stores consecutive identical values as a single value with a count.
- Example: Instead of storing [A, A, A, B, B], store [(A, 3), (B, 2)]. Use Case:
- Works well with columnar storage formats like Parquet.
- Significantly reduces storage requirements for repetitive data.