dbt Tests - iff133/first GitHub Wiki

Testing

  • dbt models are only useful if they're correct
  • dbt makes it easy to define automated tests on your models
  • After writing these tests once, you can quickly run them when developing locally, or in production on an ongoing basis. By testing your models, you can build confidence in their correctness and avoid costly regressions down the line.
  • dbt provides two different mechanisms for data validation: schema tests and custom data tests

Schema Tests

  • Schema tests are assertions that a model's schema adheres to basic rules: referential integrity, uniqueness, and nullity, for instance
  • Modern data warehouses don't enforce these rules, but they're still incredibly useful tools for reasoning about and validating data in a database

Writing Schema Tests

Built-in tests

Not null
  • This test validates that there are no null values present in a column
version: 2
models:
  - name: people
    columns:
      - name: id
        tests:
          - not_null
              
      - name: account_id
        tests:
          - not_null
              
      - name: status
        tests:
          - not_null
Unique
  • This test validates that there are no duplicate values present in a field
version: 2
models:
  - name: people
    columns:
      - name: id
        tests:
          - unique
Relationships
  • This test validates that all of the records in a child table have a corresponding record in a parent table. This property is referred to as "referential integrity"
  • The following example tests that every person's account_id maps back to a valid account
version: 2
models:
  - name: people
    columns:
      - name: account_id
        tests:
          - relationships:
              to: ref('accounts')
              field: id
  • The to argument accepts a Relation – this means you can pass it a ref to a model (e.g. ref('accounts')), or a source (e.g. source('payroll', 'accounts'))
Accepted values
  • This test validates that all of the values in a column are present in a supplied list of values. If any values other than those provided in the list are present, then the test will fail
  • The accepted_values test supports an optional quote parameter which by default will single-quote the list of accepted values in the test query. To test non-strings (like integers or boolean values) explicitly set the quote config to false
version: 2
models:
  - name: people
    columns:
      - name: status
        tests:
          - accepted_values:
              values: ['active', 'cancelled']
              
      - name: status_id
        tests:
          - accepted_values:
              values: [1, 2]
              quote: false
Putting it all together
  • The tests above were shown in separate files for clarity, but in practice, they should all be combined into a single file. This combined schema.yml file would look like this:
version: 2
models:
  - name: people
    columns:
      - name: id
        tests:
          - not_null
          - unique
      - name: account_id
        tests:
          - not_null:
              severity: warn
          - relationships:
              to: ref('accounts')
              field: id
              severity: error
            
      - name: status
        tests:
          - not_null
          - accepted_values:
              values: ['active', 'cancelled']
              severity: warn

Testing complicated logic

Testing Expressions
  • Sometimes, a constraint only holds for a combination of columns. For example, maybe the combination of user_id and date is unique, but neither column is unique in isolation
  • To test expressions like this, specify the tests directive at the model level, and use a column_name key to define the expression, as shown below.
version: 2
models:
  - name: people
    tests:
      - unique:
          column_name: "concat(date_day, user_id)"
      
      - not_null:
          column_name: "coalesce(status, new_status)"
      
      - accepted_values:
          column_name: "coalesce(status, new_status)"
          values: ['active', 'cancelled']
        
      - relationships:
          column_name: "coalesce(account_id, parent_account_id)"
          to: ref('accounts')
          field: id
  • It is recommended that users specify tests for as many constraints as can be reasonably identified in their database. This may result in a large number of total tests, but schema.yml makes it fast to create and modify these tests. The presence of tests like these can significantly increase the confidence in underlying data consistency in a database.

Custom data tests

  • Not all error conditions can be expressed in a schema test. For this reason, dbt provides a mechanism for testing arbitrary assertions about your data.
  • These data tests are sql SELECT statements that return 0 rows on success, or more than 0 rows on failure.
  • A typical data test might look like:
tests/assert_lt_5_pct_cookie_ids_are_null.sql
-- If >= 5% of cookie_ids are null, then the test returns 1 row (failure).
-- If < 5% of cookie_ids are null, then the test returns 0 rows (success)
with calc as (
    select
        case
            when cookie_id is null then 1
            else 0
        end as cookie_is_null
  
    from {{ ref('events') }}
  
),
agg as (
  
    select
        sum(cookie_is_null)::float / nullif(count(*), 0) as pct_null
    from calc
)
select *
from agg
where pct_null >= 0.05
  • Another example would be asserting that a particular column on a certain row is what you expect:
tests/assert_payment_has_expected_amount.sql
with data as (
  
  select *
  from {{ ref('payment_items') }}
  where payment_id = 777
  
),
validation as (
  select *
  from data
  where amount != 120
)
select * from validation
  • Custom data tests and model selection syntax To run custom data tests for a particular model, use the model selection syntax: $ dbt test --data --models my_model
  • Note that dependencies between custom data tests and models are defined by using the ref function in your data test, e.g. from {{ ref('my_model') }}
  • Enabling custom data tests: To enable data tests, add the test-paths config to your dbt_project.yml file:
dbt_project.yml
test-paths: ["tests"]        # look for *.sql files in the "tests" directory

Test Severity

  • The "severity" of a test can be configured by supplying the severity configuration option in the test specification. The severity option can be one of warn or error
  • If warn is supplied, then dbt will log a warning for any failing tests, but the test will still be considered passing. This configuration is useful for tests in which a failure does not imply that action is required
  • If a severity level is not provided, then tests are run with the error severity level. The severity config can be applied to any test type, including custom schema tests
Schema test severity
  • The following example exercises both warn and error level severities in a schema.yml file
models/schema.yml
version: 2
models:
  - name: people
    columns:
      - name: email
        tests:
          - not_null:
              severity: warn
          - unique:
              severity: error
Custom data test severity
  • The severity for a custom data test can be configured using the config function:
tests/my_test.sql
{{ config(severity='warn') }}
select *
from ....

Running Tests

  • Schema tests and custom data tests can be run with the dbt test command as shown below
dbt test           # run schema + data tests
dbt test --schema  # run only schema tests
dbt test --data    # run only data tests

Custom Schema Tests

https://discourse.getdbt.com/t/examples-of-custom-schema-tests/181

References

https://docs.getdbt.com/docs/building-a-dbt-project/testing-and-documentation/testing/
https://medium.com/@josh.temple/automated-testing-in-the-modern-data-warehouse-d5a251a866af