dbt Best Practices - iff133/first GitHub Wiki
All dbt projects should be managed in version control. Git branches should be created to manage development of new features and bug fixes. All code changes should be reviewed by a colleague (or yourself) in a Pull Request prior to merging into master.
dbt makes it easy to maintain separate production and development environments through the use of target within a profile.
We recommend using a dev target when running dbt from your command line, and only running against a prod target when running from a production deployment. You can read more about managing environments.
SQL styles, field naming conventions, and other rules for your dbt project should be codified, especially on projects where multiple dbt users are writing code.
- Model configuration
- dbt conventions
- Testing
- Naming and field conventions
- CTEs
- SQL style guide
- YAML style guide
- Jinja style guide
https://github.com/fishtown-analytics/corp/blob/master/dbt_coding_conventions.md
The ref function is what makes dbt so powerful! Using the ref function allows dbt to infer dependencies, ensuring that models are built in the correct order.
It also ensures that your current model selects from upstream tables and views in the same environment that you're working in.
Always use the ref function when selecting from another model, rather than using the direct relation reference (e.g. my_schema.my_table).
Your dbt project will depend on raw data stored in your database. Since this data is normally loaded by third parties, the structure of it can change over time – tables and columns may be added, removed, or renamed. When this happens, it is easier to update models if raw data is only referenced in one place.
Using sources for raw data references
- As of v0.13.0, we recommend defining your raw data as sources, and selecting from the source rather than using the direct relation reference. Our dbt projects no longer contain any direct relation references in any models.
- https://docs.getdbt.com/docs/building-a-dbt-project/using-sources/
Raw data is generally stored in a source-conformed structure, that is, following the schema and naming conventions that the source defines. Not only will this structure differ between different sources, it is also likely to differ from the naming conventions you wish to use for analytics.
The first layer of transformations in a dbt project should:
- Select from only one source
- Rename fields and tables to fit the conventions you wish to use within your project, for example, ensuring all timestamps are named
<event>_at
. These conventions should be declared in your project coding conventions (see above). - Recast fields into the correct data type, for example, changing dates into UTC and prices into dollar amounts.
All subsequent data models should be built on top of these models, reducing the amount of duplicated code.
Materializations determine the way models are built through configuration. As a general rules:
- Views are faster to build, but slower to query compared to tables.
- Incremental models provide the same query performance as tables, are faster to build compared to the table materialization, however they introduce complexity into a project.
- https://docs.getdbt.com/docs/building-a-dbt-project/building-models/materializations/
We often:
- Use views by default
- Use ephemeral models for lightweight transformations that shouldn't be exposed to end-users
- Use tables for models that are queried by BI tools
- Use tables for models that have multiple descendants
- Use incremental models when the build time for table models exceeds an acceptable threshold
Use grant statements from hooks to ensure that permissions are applied to the objects created by dbt. By codifying these grant statements in hooks, you can version control and repeatably apply these permissions.
https://discourse.getdbt.com/t/the-exact-grant-statements-we-use-in-a-dbt-project/430
https://docs.getdbt.com/docs/building-a-dbt-project/hooks/
When modeling data, we frequently find there are two stages:
-
Source-centric transformations to transform data from different sources into a consistent structure, for example, re-aliasing and recasting columns, or unioning, joining or deduplicating source data to ensure your model has the correct grain; and
-
Business-centric transformations that transform data into models that represent entities and processes relevant to your business, or implement business definitions in SQL.
We find it most useful to separate these two types of transformations into different models, to make the distinction between source-centric and business-centric logic clear.
https://docs.getdbt.com/docs/guides/best-practices/#best-practice-workflows