DBT Model - ghdrako/doc_snipets GitHub Wiki

In dbt, models are a fundamental concept. A model, in this context, is a file composed of a SQL statement that processes data, applies transformations, and yields a resulting dataset.

Following a typical DBT data transformation flow, there will be three layers in our model’s directory: staging, intermediate, and marts.

  • staging layer: our initial modular building blocks. In this layer, we have the models responsible for extracting the data from our sources. Typically here, we have already some data cleaning, data standardization, and minor transformations;
  • intermediate layer: models between the staging layer and the mart’s layer. Built on top of our staging models, and are used to have extensive data transformations, as well as data consolidation from multiple sources, creating varied intermediate tables that will serve distinct purposes;
  • marts layer: Depending on your data modeling technique, marts brings together all modular pieces to give a broader vision of the entities your company cares about. If, for example, we choose a dimensional modeling technique, inside marts layers is where your fact - occurrences that keep happening over time, such as orders, page clicks, or inventory changes, with their respective measures - and dimensions - attributes, such as customers, products, and geography, that can describe those facts - tables are. Marts can be described as subsets of data inside your data platform, oriented to specific domains or departments, such as finance, marketing, logistics, customer service, etc.

dbt’s convention is to create a subfolder for the staging, intermediate, and marts layers in models folder.

Defining sources

dbt run
dbt run --full-refresh # need in BigQuery

The four different model types:

  • Table
  • View (default)
  • Ephemeral
  • Incremental

Ephemeral models are essentially just CTEs (common table expressions) in SQL. They do not exist anywhere on the database and can only be referenced within dbt.

Incremental models materialize as a table; however, they are configured to allow dbt to insert or update records since the last time dbt ran. This is useful when you are working with large volumes of data and you don’t want to have to reload it every time. You are still limited to only writing SELECT statements; however, dbt compiles that into the commands that need to be run. In other systems, you would need to write the UPSERT or MERGE logic, but dbt makes it simple for you.

By default, DBT materializes your models, inside your data platform, as views. The same applies to models that are materialized as views. Just change table to view. It might be redundant, but it allows you to give transparency and visibility of the actual intentions. Example 2-12. Materialization config inside the model file. Example stg_jaffle_shop_customers model.sql.

{{ config(
materialized='table'
) }}
SELECT
id as customer_id,
first_name,
last_name,
current_timestamp() as dp_load_date
FROM `dbt-tutorial.jaffle_shop.customers`

Example dbt_project.yml of how to materialize models as views and as tables.

models:
  dbt_analytics_engineer_book:
    staging:
      jaffle_shop:
        +materialized: view
      stripe:
        +materialized: table

change our dbt_project.yml to set all staging models to be materialized as views

models:
  dbt_analytics_engineer_book:
    staging:
      +materialized: view

Snapshots

Snapshots are sort of their own thing in dbt, but similar in ways to the materializations we just reviewed. Snapshots are materialized as a table and used when you need to track history over time. Essentially, these transformations implement a Type-2 slowly changing dimension (SCD). That means whenever a change is detected between the source and the target, a new row is created in the target rather than just overriding the existing value. As seen in the project structure overview earlier in this chapter, snapshots live separately from everything else since they have some unique qualities about them.