dbt General Information - iff133/first GitHub Wiki

Data transformation 101

The data in any of our projects has three distinct checkpoints:

Sources: Schemas and tables in a source-conformed structure (i.e. tables and columns in a structure based on what Vanquis provided us with), loaded by pgloader into postgresql.
Staging models: The atomic unit of data modeling. Each model bears a one-to-one relationship with the source data table it represents. It has the same granularity, but the columns have been renamed, recast, or usefully reconsidered into a consistent format.
Mart models: Models that represent business processes and entities, abstracted from the data sources that they are based on.

Examples of what this looks like:

Sources: Vanquis data loaded into postgresql using pgloader.
Staging models: Data are recast into a consistent shape, with consistent column names.
Mart models: For example monthly recurring revenue (MRR) model that classifies revenue per customer per month as new revenue, upgrades, downgrades, and churn, to understand how a business is performing over time.

Important: the first two - Sources and Staging models are source-centric whereas mart models are business-centric (business logic - alp allocation logic)

Staging raw data

The goal of the staging layer is to create staging models
Staging models take raw data, and clean and prepare them for further analysis. For a user querying the data warehouse, a relation with a stg_ prefix indicates that:
- Fields have been renamed and recast in a consistent way.¹
- Datatypes, such as timezones, are consistent.
- Light cleansing, such as replacing empty string with NULL values, has occurred - .
- If useful, flattening of objects might have occurred.
- There is a primary key that is both unique and not null (and tested).

Staging models can have joins in them to field additional columns for context or enrichment; add rows through unions and remove them through filters; deduplicate a natural key or hash together a surrogate one.

For ALP we will work with multiple data sources, so in our staging directory, we create one directory per source.

Inside the staging directory we will have a directory for each csv received: allcalls_2, applicationdata, codaledger, i2bsbasesegment2, i2bsdaily, i2gtgeneratedtransactions, i2ptpostedtransactions, I2SBstatementbasesegment, ifrs9 and rptcd1750.

Example:

Each staging directory contains at a minimum:

One staging model for each object that is useful for analytics:
- Named stg_<source>_<object>
- Generally materialized as a view (unless performance requires it as a table).
A src_<source>.yml file which contains:
- Source 174 definitions, tests, and documentation
https://docs.getdbt.com/docs/building-a-dbt-project/using-sources/
A stg_<source>.yml file which contains
- Tests and documentation 64 for models in the same directory
https://docs.getdbt.com/docs/building-a-dbt-project/testing-and-documentation/

Note: Some dbt users prefer to have one .yml file per model (e.g. stg_braintree__customers.yml). This is a completely reasonable choice, and we recommend implementing it if your .yml files start to become unwieldy.

Base Models

Earlier versions of the dbt documentation recommended implementing “base models” as the first layer of transformation.

We realized that while the reasons behind this convention were valid, the naming was an opinion, so in our recent update to the best practices 419, we took the mention of base models out. Instead, we replaced it with the principles of “renaming and recasting once” and “limiting the dependencies on raw data”.

That being said, in our dbt projects every source flows through exactly one model of the following form:

This is still a base transformation and if your source data is in good shape, this transformation may be all that’s required to build a staging model, and our staging model is this SQL.

However building a staging model may warrant several models’ worth of cleaning, correcting, and categorizing, or may require a join or union to another source. So multiple base transformations might be required.

In our dbt projects, we place these base models in a nested base subdirectory.

In our projects, base models:

Often use the ephemeral materialization - so they are not exposed to end users querying our warehouse.
Are tested in a base.yml file within the same directory as the base models.

If we need additional transformations between base and staging models, we create a nested staging//intermediate directory and place these transformations in there.

Describing a business through marts

Marts are stores of models that describe business entities and processes.
They are often grouped by business unit: marketing, finance, product. Models that are shared across an entire business are grouped in a core directory.

While these are very niche and company specific The general GOAL is to build FACT and DIMENSION models that are abstracted from the source data they rely upon.
- fct_<verb>
  - Tall/narrow table
  - Representing real-world processes that have occurred or are occurring
  - Immutable event stream (eg. sessions, transactions, orders, stories, votes)
- dim_<noun>
  - Wide/short table
  - Each row is a person, place, or thing
  - The ultimate source of truth when identifying and describing entities of the organization
  - Mutable but slowly changing (eg. customers, products, candidates, buildings, employees.)
While the work of staging models is limited to cleaning and preparing, fact tables are the product of substantive data transformation: choosing (and reducing) dimensions, date-spining, executing business logic, and making informed, confident decisions.
This layer of modeling is considerably more complex than creating staging models
The models we design are highly tailored to the analytical needs of an organization
However some patterns have been found:
- fct_ and dim_ models should be materialized as tables within a warehouse to improve query performance
- Default: table materialization however where performance requires it incremental materialization should be used
- Intermediate transformations required to get to a fact or dimension model are placed in a nested marts//intermediate directory. They are named: <useful_name>__<transformation_in_past_tense>.sql.
  - The lack of prefix and use of double underscores indicates that these are intermediate models, not to be trusted, however, it may also be worth hiding these in a different schema
- Models are tested and documented in a <dir_name>.yml file in the same directory as the models
- Any extra documentation in a docs block 36 is placed in a <dir_name>.md file in the same directory

A marts directory may therefore end up looking like:

This entire project results in the following DAG:

Accessories to data

There are other kinds of SQL files that find their way into robust dbt projects. In addition to staging and marts, we find ourselves with model directories such as:

utils: An all_days table. This is useful everywhere, though it never forms the basis for analysis/reporting.
lookups: A user-mapping table, a zipcode-country table, etc. These are as likely to be CSV seeds 42 as tables in a production database. You may reference it at several unpredictable points throughout modeling, and maybe even in a BI tool.
admin: Audit logs, warehouse operations, Redshift maintenance, and incremental records of the miscellaneous DDL you run to make your project run smoothly.
metrics: Precisely defined measurements taken from fact tables, directly conducive to time-series reporting, and tightly structured so as to allow one-to-one comparison with goals and forecasting. A metrics table lives downstream of dimension and fact tables in your DAG, and it deserves special status.
Packages: While not a model folder within your main project, packages that include models (like our snowplow 39 package) can be configured into custom schema and materialization patterns from dbt_project.yml.

In projects where we find ourselves with these additional models, we often leverage custom schemas 64 as directories in our warehouse, to logically group the models, choosing a schema name that matches the directory name in our dbt project.

Final thoughts

In this article, building the DAG for a dbt project has been described left to right: starting at sources, and ending with marts models.

However, it’s worth noting that in reality we often first think through a modeling problem from right to left:

we start with an idea of the dashboard or report we want to build
then whiteboard the structure of the marts model we need in our warehouse to power this dashboard.
On the same whiteboard, we’ll often then work backwards until we reach our source, before we start writing any actual SQL.
I’ve found that it’s only once I’ve solved a modeling problem a few times that I get an intuition for how to build a DAG from left to right. In other words: we tend to think about our destination before we start our modeling journey.

Source:

https://discourse.getdbt.com/t/how-we-structure-our-dbt-projects/355