dbt Snapshots - iff133/first GitHub Wiki

Track data changes with dbt snapshots

What are Snapshots?

  • Snapshots implement type-2 Slowly Changing Dimensions over mutable source tables. These Slowly Changing Dimensions (or SCDs) identify how a row in a table changes over time.
  • snapshots make it possible to generate a stream of record changes from a source table
  • turning mutable data into immutable data
  • dbt creates this Snapshot table by copying the structure of your source table, and then adding some helpful metadata fields.
    • The dbt_valid_from and dbt_valid_to columns indicate the historical state for a given record.
    • The current value for a row is represented with a null value for dbt_valid_to.

Snapshot meta-fields

Snapshot tables will be created as a clone of your source dataset, plus some addition meta-fields.

image

How do dbt snapshots work?

  • Snapshots are simple select statements which define a dataset
  • Every time you run the dbt snapshot command, dbt will run your select statement and check if the dataset has changed compared to its last known state
  • If the records have changed, then dbt will create a new entry in the snapshot table for the new state of the record. If there are net-new records in the source dataset, then dbt will create an initial entry for the record.
  • Behind the scenes, dbt runs all the appropriate DDL and DML to insert, update, and merge records into your snapshot table. If new columns appear in your source query, then dbt will add these columns to your snapshot table.
  • The particular mechanism for tracking changes that dbt uses is often referred to as Type 2 Slowly Changing Dimensions.

What should I snapshot?

  • Snapshots, like models, are defined as select queries.

  • You can use source and ref in these queries like you would in any model query.

  • With this flexibility you can snapshot effectively any part of your dbt DAG.

  • Snapshots should almost always be run against source tables

    • Important: snapshot your source data in its raw form and use downstream models to clean up the data.
    • Your models should then select from these snapshots, using the ref function.

image

  • Why run snapshots against source tables? Why not snapshot the results of models?
    • Snapshot on Sources:

      • By snapshotting your sources, you can maximize the amount of data that you track, and in turn, maximize your modeling optionality.
      • Snapshots, by their very nature, are not idempotent.
      • The results of a snapshot operation will vary depending on if you run dbt snapshot once per hour or once per day.
      • Further, there’s no way to go back in time and re-snapshot historical data.
      • Once a source record has been mutated, the previous state of that record is effectively lost forever.
    • Snapshot on Models:

      • A snapshot that runs on top of a model will record changes to “transformed” data.
      • If the logic in a model upstream of a snapshot changes, then the snapshot operation will record different data.
      • In some cases this can be desirable, but for typical modeling use cases, it presents an opportunity for data loss.
      • Logic errors in data models are inevitable, but dbt works hard to make sure that you can fix these bugs and rerun your dbt project. This isn’t the case if you have a snapshot in the middle of your DAG: that bug will likely result in data loss.

Getting started with snapshots

  • The usefulness of snapshots relies on them being run regularly and reliably
  • Once you’ve added snapshots to your project, make sure you have a way to run the dbt snapshot command on a schedule (it's a good idea to schedule this job separately to your dbt run).
  • Further, ensure that you have monitoring and alerting set up to find out quickly if something goes wrong. We use dbt Cloud to schedule our snapshots and leverage Cloud’s built-in email and Slack notifications for alerting.

How do I add snapshots to my dbt project?

  • Snapshots are defined in .sql files using snapshot blocks.
  • dbt will look for snapshot blocks in the snapshot-paths path defined in your dbt_project.yml file.
    • An example snapshot block might look like this:

image

  • Snapshot blocks make it possible to snapshot a query. In general, it is most useful to snapshot a select statement over a single table. If you're thinking about snapshotting a join or a union, consider instead snapshotting each table separately, then re-combining them in a model.

  • The snapshot table generated by dbt will be created with the target_database, target_schema, and name defined in the snapshot block. The snapshot block shown above (orders_snapshot) will be rendered into a table called analytics.snapshots.orders_snapshot.

Tracking data changes

dbt ships with two different strategies for determining if rows have changed in a table.

In both cases, dbt relies on a provided unique_key to match rows in your snapshot query to rows in your snapshot table. Each strategy is examined in further detail below.

  • Timestamp (preferred)
    • The timestamp strategy uses an updated_at field to determine if a row has changed.
    • When you run the dbt snapshot command, dbt checks whether the updated_at columns for a row is more recent than the last time the snapshot ran.
      • If it is, then dbt will invalidate the old record and insert the new one.
      • If the timestamps are unchanged, then dbt will not take any action, as the row has presumably not changed.
      • Example usage:

image

image

  • Check
    • Some data sources do not include a reliable timestamp that indicates when a record has changed.
    • If this is the case for your data, you can instead use the check strategy.
    • This strategy works by comparing the values of a list of columns between their current and snapshotted values.
    • If any of these column values have changed, then dbt will invalidate the old record and insert the new one. If the column values are identical, then dbt will not take any action.
    • Example usage:

image

image

Configuring snapshots

Configuration reference

image

Configuring snapshots in dbt_project.yml

  • The snapshots in your dbt project can be configured using the snapshot: key in your dbt_project.yml file. This configuration is analogous to the models: config described in the Projects documentation.
  • Example usage:

image

Configuration best practices

Use the timestamp strategy where possible

This strategy handles column additions and deletions better than the check_cols strategy.

Ensure your unique key is really unique

The unique key is used by dbt to match rows up, so it's extremely important to make sure this key is actually unique! If you're snapshotting a source, I'd recommend adding a uniqueness test to your source (example).

Use a target_schema that is separate to your analytics schema

  • Snapshots cannot be rebuilt.
  • As such, it's a good idea to put snapshots in a separate schema so end users know they are special.
  • From there, you may want to set different privileges on your snapshots compared to your models, and even run them as a different user (or role, depending on your warehouse) to make it very difficult to drop a snapshot unless you really want to.

Writing snapshot queries

Snapshot query best practices

With regards to the specific query to write in your snapshot, we recommend that you:

  • Snapshot source data. Your models should then select from these snapshots, treating them like regular data sources. As much as possible, snapshot your source data in its raw form and use downstream models to clean up the data
  • Ensure your unique key is really unique.
  • Use the source function in your query. This helps when understanding data lineage in your project.
  • Include as many columns as possible. In fact, go for select * if performance permits! Even if a column doesn't feel useful at the moment, it might be better to snapshot it in case it becomes useful – after all, you won't be able to recreate the column later.
  • Avoid joins in your snapshot query. Joins can make it difficult to build a reliable updated_at timestamp. Instead, snapshot the two tables separately, and join them in downstream models.
  • Limit the amount of transformation in your query. If you apply business logic in a snapshot query, and this logic changes in the future, it can be impossible (or, at least, very difficult) to apply the change in logic to your snapshots.

Basically – keep your query as simple as possible! Some reasonable exceptions to these recommendations include:

  • Selecting specific columns if the table is wide.
  • Doing light transformation to get data into a reasonable shape, for example, unpacking a JSON blob to flatten your source data into columns.

Running snapshots

  • Snapshots can be run using the dbt snapshot command. This command will find and run all of the snapshots defined in your project.
  • To run a subset of snapshots, use the --select flag, supplying a selector. For more information on resource selectors, consult the selection syntax.
  • Example usage:

image

Scheduling your snapshots

  • Snapshots are a batch-based approach to change data capture.
  • The dbt snapshot command must be run on a schedule to ensure that changes to tables are actually recorded!
  • While individual use-cases may vary, snapshots are intended to be run between hourly and daily.
  • If you find yourself snapshotting more frequently then that, consider if there isn't a more appropriate way to capture changes in your source data tables.

Schema changes

  • When the schema of your source query changes, dbt will attempt to reconcile the schema change in the destination snapshot table. dbt does this by:

    1. Creating new columns from the source query in the destination table
    2. Expanding the size of string types where necessary (eg. varchars on Redshift)
  • dbt will not delete columns in the destination snapshot table if they are removed from the source query. It will also not change the type of a column beyond expanding the size of varchar columns. That is, if a string column is changed to a date column in the snapshot source query, dbt will not attempt to change the type of the column in the destination table.

Treat snapshots like sources

  • dbt is built around the idea that all of your data transformations should be idempotent and completely rebuildable from scratch.
  • Snapshots break this paradigm due to the nature of the problem that they solve. Because snapshots capture changes in source tables, they need to be running constantly in order to record changes to mutable tables as they occur.
  • As such, it's typical to only have one snapshot table per data source for all dbt users, rather than one snapshot per user. In this way, snapshot tables are more similar to source tables than they are to proper dbt models.

Links:

Good overview diagram

image