DBT - ghdrako/doc_snipets GitHub Wiki

DBT

dbt is just a transformation tool and does not handle data movement outside the data store it connects to. So when you consider extract-transform-load (ETL) and extract-load-transform (ELT) designs, this tool is only the transform component of these patterns.

  • DBT Cloud
  • DBT Core, open source CLI (Command Line Interface) tool that you can set up on your managed environments.

Structure of a DBT Project

root/
β”œβ”€ analyses/
β”œβ”€ dbt_packages/
β”œβ”€ logs/
β”œβ”€ macros/
β”œβ”€ models/
β”‚ β”œβ”€ example/
β”‚ β”‚ β”œβ”€ schema.yml
β”‚ β”‚ β”œβ”€ my_second_dbt_model.sql
β”‚ β”‚ β”œβ”€ my_first_dbt_model.sql
β”œβ”€ seeds/
β”œβ”€ snapshots/
β”œβ”€ target/
β”œβ”€ tests/
β”œβ”€ .gitignore
β”œβ”€ dbt_project.yml
β”œβ”€ README.md

  • Analyses folder is a place where you can store things that need to be saved, but aren’t necessarily part of your data model. dbt compiles everything that is stored in this directory, but it does not execute as part of your run or build. We view this as more of a sandbox area where Analysts can archive, create, and share queries used outside your core model, but still take advantage of source control and referencing other models.

  • dbt_packages folde is where these packages materialize in your project. You will need to add an entry into your packages.yml file to include the package and then use the dbt deps command to materialize it. Once you do that, then models built using the package will become usable in your project. There are packages that contain macros (a.k.a. functions) to simplify transformations that may require complicated SQL, and there are packages built for specific datasets that model the data for you. You can check GitHub for them, the dbt Slack community, or you can check https://hub.getdbt.com/ for the list acknowledged by dbt Labs.

  • Logs folder - is a directory within your dbt project that is used to store log files. Log files contain information about the progress and results of your dbt runs, including any errors or warning messages that are generated and can be helpful when troubleshooting failures.

  • Macros folder store macros. Macros are pieces of code that can be reused multiple times throughout your project. They are synonymous with functions in other programming languages and are written using Jinja.

  • Models folder is the main location for your data transformations in your dbt project and will be where you spend most of your time. Inside this folder, you will create single files that contain logic as part of the process to transform raw data into your final data state. You will usually have several subdirectories that exist within this folder that could include parts of the transformation process such as staging, marts, and intermediate.

  • seed folder is a directory within your dbt project that contains seed data. Seed data is data that can be loaded into your target database before dbt runs any of your models. This data is typically used to populate lookup tables or to provide reference data for your analytics. It should only be used for small sets of static data that may not live in a table in a source database, but still needs to be referenced. For example, you may have a CSV file that contains a list of countries and their country code that you need to reference throughout your models. This list will be small and unlikely to change very often, so you could store it in a seed file rather than your source database.

  • snapshot folder is a directory within your dbt project that is used to store snapshots. Because of its uniqueness, it is the only type of dbt transformation that is broken out into its own folder. Snapshot data is a copy of the data in your source database at a specific point in time. Snapshots are a feature of dbt that are used to track changes to your data over time and to ensure that your models are based on accurate and up-to-date data. Snapshots very closely mirror a Type-2 slowly changing dimension where you have valid_from and valid_to dates. This allows you to retain history of records and capture point-in-time data values.

  • target folder serves a few purposes, but the main two are to store metadata files and compiled SQL. With nearly every command execution in dbt, it will generate and save one or more artifacts that are also contained in the target directory. Several of these are JSON files manifest.json, catalog.json, run_results.json) that are for a lot of different purposes. These include generating documentation, performing state comparisons, change tracking, project monitoring, and so much more.

  • tests folder is a directory within your dbt project that is used to store test files. Test files contain SQL queries that are used to test the accuracy and integrity of your model results. dbt has several common tests that are native to the product and just require you to reference them in your YAML files; however, you can also build custom tests to support your specific business case.

dbt_project.yml

This file must be in the root of the project, and it is the main configuration file for your project, containing pertinent information for DBT properly operate. Project default configurations will be stored here, and all objects will inherit from it unless overridden at the model level.

  • name: [Mandatory] the name of the DBT project. We recommend changing this configuration to your project name. Also, remember to change it in the model’s section and the dbt_project.yml file.
  • version: [Mandatory] core version of your project;
  • config-version: [Mandatory] version 2 is the currently available version;
  • profile: [Mandatory] profile DBT uses to connect to your data platform;
  • [folder]-paths: [Optional] where [folder] is the list of folders in the DBT project. It can be a model, seed, test, analysis, macro, snapshot, log, etc. For example, the model-paths will state the directory of your models and sources. The macro-paths is where your macros code live, and so on;
  • target-path: [Optional] path will store compiled SQL files;
  • clean-targets: [Optional] list of directories containing artifacts to be removed by dbt clean command;
  • models: [Optional] default configuration of the models.
models:
  dbt_analytics_engineer_book:
    staging:
      materialized: view

packages.yml

Packages are standalone DBT projects that tackle specific problems and can be reused and shared across organizations. They are projects with models and macros; by adding them to your project, those models and macros will become part of it. To access those packages, initially, you need to define them in the packages.yml file. The detailed steps are as follows:

  • First, you must ensure that the packages.yml file is in your DBT project. If not, please create it at the same level as your dbt_project.yml file;
  • Inside the DBT packages.yml file, define the packages you want to have available for use inside your DBT project. You can install packages from sources like DBT hub, Git repositories, such as GitHub or GitLab, or even packages you have stored locally.
  • Finally, just run dbt deps to install the defined packages. Unless you configure differently, by default, those packages get installed in the dbt_packages directory. Example 2-6. packages.yml - syntax used to install it from the DBT hub, Git, or locally.
packages:
- package: dbt-labs/dbt_utils
version: 0.9.2
 - git: "https://github.com/dbt-labs/dbt-utils.git"
revision: 0.9.2
- local: /opt/dbt/bigquery

DBT test

DBT Tools

Repozytorim pakietΓ³w dla dbt

dbt Cloud