Data Modeling ‐ Slowly Changing Dimensions and Idempotency - sachit914/datawarehouse GitHub Wiki
Data Modeling - Slowly Changing Dimensions and Idempotency
refernce https://www.youtube.com/watch?v=emQM9gYh0Io
Slowly Changing Dimensions (SCDs)
- SCDs are attributes in a dimension table that change over time.
- phone number of user wont change frequently it changes over time , similarly address also does not changes frequently
Idempotent Pipelines
-
A system is idempotent if running the same operation multiple times yields the same result.
-
Pipelines should produce the same results
- Regardless of the day you run it
- Regardless of how many times you run it
- Regardless of the hour that you run it
-
In a non-idempotent pipeline, running the same pipeline multiple times can lead to inconsistent data. For example, if the pipeline is rerun (due to a failure or a scheduled backfill), it might produce different results, or duplicate records. This happens because when the pipeline fails, some data may have already been processed and added, and when it’s rerun, the same data might be processed again, leading to duplicates
Common Pitfalls in Non-Idempotent Pipelines:
1 Insert Without Truncate
- Problem: If you insert data into a table without clearing previous data, each pipeline run will add duplicate records. For instance, if you're processing daily sales data and using an INSERT INTO without cleaning the table, every time the pipeline runs, new rows will be added without removing the old data, resulting in multiple entries for the same record.
- Solution: Use MERGE or INSERT OVERWRITE strategies.
- MERGE: This merges new data with existing data, checking if records already exist and avoiding duplicates.
- INSERT OVERWRITE: Overwrites the data in specific partitions, ensuring consistency across all pipeline runs.
2 Improper Use of Date Filters
- Problem: Using filters like WHERE date > today without a corresponding end_date causes your pipeline to process an ever-growing set of data every time it runs. This results in missing or duplicated data and can even cause out-of-memory errors during backfilling.
- how :When you use a filter like WHERE date > today, the pipeline only looks for data that's after the current day. However, if you don't set an end date (a point where the data selection stops), the filter will keep picking up new data each time the pipeline runs, even data that was already processed before.
- Solution: Always define a bounded time window. For example, use a filter like WHERE date > 'start_date' AND date <= 'end_date' to ensure you're processing only the intended time window of data.
3 Cumulative Pipelines and Sequential Processing
- Problem: For pipelines that depend on cumulative data (like daily totals), trying to run processes in parallel without considering the correct sequence can cause inconsistencies. For instance, if you're aggregating daily sales but the pipeline processes today’s data before yesterday’s, you might lose important intermediate totals.
- Solution: Use sequential processing to ensure each day's data is processed in order. For example, a cumulative sales total should process yesterday’s data before today’s data to ensure the accuracy of daily aggregates.
What is a Slowly Changing Dimension (SCD)?
- An SCD is an attribute (dimension) in a dataset that changes over time but not rapidly or frequently
Approaches to Modeling Slowly Changing Dimensions
1. Latest Snapshot
- What it is: Only the current value of the dimension is stored.
- Advantages: Simple and storage-efficient.
- Disadvantages: Loses historical context. Backfilling can lead to inaccurate records.
- Example: A customer’s current country of residence is stored, or overwritten to their old value so their past countries are lost.
2. Daily Snapshot
- What it is: A snapshot is taken daily, storing the value of the dimension for each day.
- Advantages: Maintains full historical data.
- Disadvantages: Higher storage costs. If dimensions rarely change, this approach can lead to redundant rows.
- Example: Storing a customer’s country every day, even if it hasn’t changed for years.
3. Slowly Changing Dimension (SCD)
- What it is: Stores only the periods when a dimension changes, reducing redundancy.
- Advantages: Efficient storage. Maintains historical context without storing unnecessary rows.
- Disadvantages: Requires logic to identify and collapse unchanged rows.
- Example: If a customer lived in the US from 2010–2015 and Canada from 2015 onward, you would store:
- US (2010–2015)
- Canada (2015–present)
Slowly Changing Dimensions (SCD) Types
- SCD Type 0: Immutable dimensions (values never change).
- SCD Type 1: Only keep the latest value. Overwrites previous records, losing historical information.
- SCD Type 2: Maintain complete history with start and end dates. The gold standard for analytics.
- SCD Type 3: Maintain only the original and latest values, losing intermediate changes.
SCD Type 0 (Fixed Dimensions)
- Definition: Values are immutable and will never change.
- Use Case: Static attributes like “Date of Birth” or “Country of Origin.”
SCD Type 1: Latest-Only Dimensions
- Definition: Store only the current value, overwriting old ones.
- Use Case: Transactional systems (OLTP) where historical data isn’t critical.
- Why it’s problematic:
- Historical accuracy is lost.
- Pipelines lose idempotency, making backfill operations unreliable.
- Unsuitable for analytical contexts where historical trends matter.
Type 2: Complete History
- Definition: Each change creates a new record with start_date and end_date to capture the validity range of that version.
- Use Case: Analytical systems where maintaining a history of changes is crucial (e.g., tracking customer preferences over time).
- Advantages:
- Maintains historical accuracy.
- Supports time-based queries (e.g., "What was the state on this date?").
- Allows backfill without corrupting old data.
- Implementation Considerations:
- Use far-future dates (e.g., 9999-12-31) or NULL for end_date of the current record.
- Add a Boolean is_current column for easier identification of active records.
Type 3: Limited History
- Definition: Only store the original and current values, along with optional timestamps.
- Use Case: Situations where tracking intermediate changes isn’t necessary, and simplicity is preferred.
- Advantages:
- Simple table structure (single row per entity).
- Easier querying compared to Type 2.
- Disadvantages:
- Fails when dimensions change multiple times.
- Loses detailed historical context.
- Not idempotent for backfill operations.
Loading SCD Tables
Full Reprocessing:
- Definition: Process all historical data at once, overwriting the table each time.
- Advantages:
- Easier to implement for small datasets.
- Ensures consistency across the entire dataset.
- Drawbacks:
- Computationally expensive for large datasets.
Incremental Updates:
- Definition: Process only new or changed data since the last update.
- Advantages:
- More efficient for large datasets.
- Reduces processing time and computational resources.
- Challenges:
- Requires robust change detection mechanisms.