dbt Best Practices - iff133/first GitHub Wiki

Best practice workflows

Version Control

All dbt projects should be managed in version control. Git branches should be created to manage development of new features and bug fixes. All code changes should be reviewed by a colleague (or yourself) in a Pull Request prior to merging into master.

Use separate development and production environments

dbt makes it easy to maintain separate production and development environments through the use of target within a profile.

We recommend using a dev target when running dbt from your command line, and only running against a prod target when running from a production deployment. You can read more about managing environments.

Use a style guide and code conventions for your project

SQL styles, field naming conventions, and other rules for your dbt project should be codified, especially on projects where multiple dbt users are writing code.

Model configuration
dbt conventions
Testing
Naming and field conventions
CTEs
SQL style guide
YAML style guide
Jinja style guide

https://github.com/fishtown-analytics/corp/blob/master/dbt_coding_conventions.md

Best practices in dbt projects

Use the ref function

The ref function is what makes dbt so powerful! Using the ref function allows dbt to infer dependencies, ensuring that models are built in the correct order.

It also ensures that your current model selects from upstream tables and views in the same environment that you're working in.

Always use the ref function when selecting from another model, rather than using the direct relation reference (e.g. my_schema.my_table).

Limit references to raw data

Your dbt project will depend on raw data stored in your database. Since this data is normally loaded by third parties, the structure of it can change over time – tables and columns may be added, removed, or renamed. When this happens, it is easier to update models if raw data is only referenced in one place.

Using sources for raw data references

As of v0.13.0, we recommend defining your raw data as sources, and selecting from the source rather than using the direct relation reference. Our dbt projects no longer contain any direct relation references in any models.
https://docs.getdbt.com/docs/building-a-dbt-project/using-sources/

Rename and recast fields once (part of the of the base model - 1st layer of tranformations)

Raw data is generally stored in a source-conformed structure, that is, following the schema and naming conventions that the source defines. Not only will this structure differ between different sources, it is also likely to differ from the naming conventions you wish to use for analytics.

The first layer of transformations in a dbt project should:

Select from only one source
Rename fields and tables to fit the conventions you wish to use within your project, for example, ensuring all timestamps are named <event>_at. These conventions should be declared in your project coding conventions (see above).
Recast fields into the correct data type, for example, changing dates into UTC and prices into dollar amounts.

All subsequent data models should be built on top of these models, reducing the amount of duplicated code.

Break complex models up into smaller pieces

Group your models in directories

Add tests to your models

Consider the information architecture of your data warehouse

Choose your materializations wisely

Materializations determine the way models are built through configuration. As a general rules:

Views are faster to build, but slower to query compared to tables.
Incremental models provide the same query performance as tables, are faster to build compared to the table materialization, however they introduce complexity into a project.
https://docs.getdbt.com/docs/building-a-dbt-project/building-models/materializations/

We often:

Use views by default
Use ephemeral models for lightweight transformations that shouldn't be exposed to end-users
Use tables for models that are queried by BI tools
Use tables for models that have multiple descendants
Use incremental models when the build time for table models exceeds an acceptable threshold

Pro-tips for workflows

Use the model selection syntax when running locally

Pro-tips for dbt Projects

Limit the data processed when in development

Use hooks to manage privileges on objects that dbt creates

Use grant statements from hooks to ensure that permissions are applied to the objects created by dbt. By codifying these grant statements in hooks, you can version control and repeatably apply these permissions.

https://discourse.getdbt.com/t/the-exact-grant-statements-we-use-in-a-dbt-project/430

https://docs.getdbt.com/docs/building-a-dbt-project/hooks/

Separate source-centric and business-centric transformations

When modeling data, we frequently find there are two stages:

Source-centric transformations to transform data from different sources into a consistent structure, for example, re-aliasing and recasting columns, or unioning, joining or deduplicating source data to ensure your model has the correct grain; and
Business-centric transformations that transform data into models that represent entities and processes relevant to your business, or implement business definitions in SQL.

We find it most useful to separate these two types of transformations into different models, to make the distinction between source-centric and business-centric logic clear.

Managing whitespace generated by Jinja

Detailed info for each section

https://docs.getdbt.com/docs/guides/best-practices/#best-practice-workflows