Data Modeling ‐ Slowly Changing Dimensions and Idempotency - sachit914/datawarehouse GitHub Wiki

SCDs are attributes in a dimension table that change over time.
phone number of user wont change frequently it changes over time , similarly address also does not changes frequently

Idempotent Pipelines

A system is idempotent if running the same operation multiple times yields the same result.
Pipelines should produce the same results
- Regardless of the day you run it
- Regardless of how many times you run it
- Regardless of the hour that you run it
In a non-idempotent pipeline, running the same pipeline multiple times can lead to inconsistent data. For example, if the pipeline is rerun (due to a failure or a scheduled backfill), it might produce different results, or duplicate records. This happens because when the pipeline fails, some data may have already been processed and added, and when it’s rerun, the same data might be processed again, leading to duplicates

Problem: If you insert data into a table without clearing previous data, each pipeline run will add duplicate records. For instance, if you're processing daily sales data and using an INSERT INTO without cleaning the table, every time the pipeline runs, new rows will be added without removing the old data, resulting in multiple entries for the same record.
Solution: Use MERGE or INSERT OVERWRITE strategies.
- MERGE: This merges new data with existing data, checking if records already exist and avoiding duplicates.
- INSERT OVERWRITE: Overwrites the data in specific partitions, ensuring consistency across all pipeline runs.

Problem: Using filters like WHERE date > today without a corresponding end_date causes your pipeline to process an ever-growing set of data every time it runs. This results in missing or duplicated data and can even cause out-of-memory errors during backfilling.
- how :When you use a filter like WHERE date > today, the pipeline only looks for data that's after the current day. However, if you don't set an end date (a point where the data selection stops), the filter will keep picking up new data each time the pipeline runs, even data that was already processed before.
Solution: Always define a bounded time window. For example, use a filter like WHERE date > 'start_date' AND date <= 'end_date' to ensure you're processing only the intended time window of data.

Problem: For pipelines that depend on cumulative data (like daily totals), trying to run processes in parallel without considering the correct sequence can cause inconsistencies. For instance, if you're aggregating daily sales but the pipeline processes today’s data before yesterday’s, you might lose important intermediate totals.
Solution: Use sequential processing to ensure each day's data is processed in order. For example, a cumulative sales total should process yesterday’s data before today’s data to ensure the accuracy of daily aggregates.

An SCD is an attribute (dimension) in a dataset that changes over time but not rapidly or frequently

What it is: Only the current value of the dimension is stored.
Advantages: Simple and storage-efficient.
Disadvantages: Loses historical context. Backfilling can lead to inaccurate records.
Example: A customer’s current country of residence is stored, or overwritten to their old value so their past countries are lost.

What it is: A snapshot is taken daily, storing the value of the dimension for each day.
Advantages: Maintains full historical data.
Disadvantages: Higher storage costs. If dimensions rarely change, this approach can lead to redundant rows.
Example: Storing a customer’s country every day, even if it hasn’t changed for years.

What it is: Stores only the periods when a dimension changes, reducing redundancy.
Advantages: Efficient storage. Maintains historical context without storing unnecessary rows.
Disadvantages: Requires logic to identify and collapse unchanged rows.
Example: If a customer lived in the US from 2010–2015 and Canada from 2015 onward, you would store:
- US (2010–2015)
- Canada (2015–present)

SCD Type 0: Immutable dimensions (values never change).
SCD Type 1: Only keep the latest value. Overwrites previous records, losing historical information.
SCD Type 2: Maintain complete history with start and end dates. The gold standard for analytics.
SCD Type 3: Maintain only the original and latest values, losing intermediate changes.

Definition: Store only the current value, overwriting old ones.
Use Case: Transactional systems (OLTP) where historical data isn’t critical.
Why it’s problematic:
- Historical accuracy is lost.
- Pipelines lose idempotency, making backfill operations unreliable.
- Unsuitable for analytical contexts where historical trends matter.

Definition: Each change creates a new record with start_date and end_date to capture the validity range of that version.
Use Case: Analytical systems where maintaining a history of changes is crucial (e.g., tracking customer preferences over time).
Advantages:
- Maintains historical accuracy.
- Supports time-based queries (e.g., "What was the state on this date?").
- Allows backfill without corrupting old data.
Implementation Considerations:
- Use far-future dates (e.g., 9999-12-31) or NULL for end_date of the current record.
- Add a Boolean is_current column for easier identification of active records.

Definition: Only store the original and current values, along with optional timestamps.
Use Case: Situations where tracking intermediate changes isn’t necessary, and simplicity is preferred.
Advantages:
- Simple table structure (single row per entity).
- Easier querying compared to Type 2.
Disadvantages:
- Fails when dimensions change multiple times.
- Loses detailed historical context.
- Not idempotent for backfill operations.

Definition: Process all historical data at once, overwriting the table each time.
Advantages:
- Easier to implement for small datasets.
- Ensures consistency across the entire dataset.
Drawbacks:
- Computationally expensive for large datasets.

Definition: Process only new or changed data since the last update.
Advantages:
- More efficient for large datasets.
- Reduces processing time and computational resources.
Challenges:
- Requires robust change detection mechanisms.