Standard Hub - ScalefreeCOM/datavault4dbt GitHub Wiki
This macro creates a standard Hub entity based on one or more stage models. The macro requires an input source model similar to the output of the stage macro. So by default the stage models would be used as source models for hubs. If a Hub is loaded by multiple sources, each source needs to have the same number of Business Key columns. Additionally, a multi-source hub needs a rsrc_static Attribute defined for each source.
Features:
- Loadable by multiple sources
- Supports multiple updates per batch and therefore initial loading
- Using a dynamic high-water-mark to optimize loading performance of multiple loads
- Allows source mappings for deviations between source column names and hub column names
Parameter | Data Type | Required | Default Value | Explanation |
---|---|---|---|---|
hashkey | string | mandatory | - | Name of the hashkey column inside the stage, that should be used as PK of the Hub. |
business_keys | string | list of strings | mandatory | - | Name(s) of the business key columns that should be loaded into the hub and are the input of the hashkey column. Needs to be available inside the stage model. If the names differ between multiple sources, you should define here how the business keys should be called inside the final hub model. The actual input column names need to be defined inside the 'source_model' parameter then. |
source_models | string | list of dictionaries | dictionary | mandatory | - | If single source, just a string holding the name of the stage model is required. For multi soure Hubs, a list of dictionaries with information about each source is required. For more information see this wiki page! The inner dictionaries need to have 'name' as a key, and optionally the keys 'rsrc_static', 'hk_column' and 'bk_columns'.
For further information about the rsrc_static attribute, please visit the following wiki page: rsrc_static Attribute |
disable_hwm | boolean | optional | False | Whether the automatic application of a High-Water Mark (HWM) should be disabled or not. |
src_ldts | string | optional | datavault4dbt.ldts_alias | Name of the ldts column inside the source models. Needs to use the same column name as defined as alias inside the staging model |
src_rsrc | string | optional | datavault4dbt.rsrc_alias | Name of the rsrc column inside the source models. Is optional, will use the global variable 'datavault4dbt.rsrc_alias'. Needs to use the same column name as defined as alias inside the staging model. |
{{ config(materialized='incremental') }}
{%- set yaml_metadata -%}
hashkey: 'hk_account_h'
business_keys:
- account_key
- account_number
source_models: stage_account
{%- endset -%}
{%- set metadata_dict = fromyaml(yaml_metadata) -%}
{{ datavault4dbt.hub(hashkey=metadata_dict.get('hashkey'),
business_keys=metadata_dict.get('business_keys'),
source_models=metadata_dict.get('source_models')
) }}
-
hashkey: This hashkey column was created before inside the corresponding staging area, using the stage macro.
-
business_keys: This hub has two business keys which are both defined here. Need to equal the input columns for the hashkey column.
-
source_models: This would create a hub loaded from only one source, which is not uncommon. It uses the model 'stage_account' and since no 'bk_columns' are specified, the same columns as defined in 'business_keys' will be selected from the source.
- The 'rsrc_static' attribute is not set, because it is not required for single source entities. For more information see rsrc_static Attribute.
{{ config(materialized='incremental') }}
{%- set yaml_metadata -%}
hashkey: 'hk_account_h'
business_keys:
- account_key
- account_number
source_models:
- name: stage_account
rsrc_static: '*/SAP/Accounts/*'
- name: stage_partner
hk_column: 'hk_partner_h'
bk_columns:
- partner_key
- partner_number
rsrc_static: '*/SALESFORCE/Partners/*'
{%- endset -%}
{%- set metadata_dict = fromyaml(yaml_metadata) -%}
{{ datavault4dbt.hub(hashkey=metadata_dict.get('hashkey'),
business_keys=metadata_dict.get('business_keys'),
source_models=metadata_dict.get('source_models')
) }}
-
hashkey: This hashkey column was created before inside the corresponding staging area, using the stage macro.
-
business_keys: This hub has two business keys which are both defined here. Need to equal the input columns for the hashkey column.
-
source_models: This would create a hub loaded from two sources, which also is not uncommon. It uses the stage model 'stage_account' and since the parameter 'bk_columns' is not set, it will use the value defined in the upper level parameter 'business_keys'. Additionally the model 'stage_partner' is used, with the assumption that both sources share the same definition of an account, just under different names. Therefore a different business key column is defined as 'bk_columns', but the number of business key columns must be the same over all sources, which is the case here. The hashkey column inside this stage is called 'hk_partner_h' and is therefore defined under 'hk_column'. If it would not be defined, the macro would always search for a column called similar to the 'hashkey' parameter defined one level above.
- The static part of the record source column inside 'stage_partner' is '/SALESFORCE/Partners/'. For further information about the rsrc_static attribute, please visit the following wiki page: rsrc_static Attribute
The High-Water Mark can be disabled safely, but typically would decrease the performance again.
We recommend to try a bit what works best in your environment. You basically have three options:
- Keep HWM activated, for multi-source Hubs this would require the rsrc_static to be defined for each source. For single source Hubs, nothing needs to be done, the HWM is activated automatically.
- Disable the HWM entirely. For multi-source Hubs you just need to not specify the rsrc_static attribute. For single-source Hubs you need to add the parameter disable_hwm=true to your Hub macro call.
- Move the HWM to a previous layer: First, you apply the previous step to disable the HWM in the Hubs. Then you implement some kind of mechanism in previous dbt layers to ensure that only records newer than what you already processed are available there. This could be especially effective when combining with different materializations of these previous layers.
Parameter Data Type Required Default Value Explanation hashkey string mandatory - Name of the hashkey column inside the stage, that should be used as PK of the Hub. business_keys string | list of strings mandatory - Name(s) of the business key columns that should be loaded into the hub and are the input of the hashkey column. Needs to be available inside the stage model. If the names differ between multiple sources, you should define here how the business keys should be called inside the final hub model. The actual input column names need to be defined inside the 'source_model' parameter then. source_models string | list of dictionaries | dictionary mandatory - If single source, just a string holding the name of the stage model is required. For multi soure Hubs, a list of dictionaries with information about each source is required. For more information see this wiki page! The inner dictionaries need to have 'name' as a key, and optionally the keys 'rsrc_static', 'hk_column' and 'bk_columns'. For further information about the rsrc_static attribute, please visit the following wiki page: rsrc_static Attribute
disable_hwm boolean optional False Whether the automatic application of a High-Water Mark (HWM) should be disabled or not. src_ldts string optional datavault4dbt.ldts_alias Name of the ldts column inside the source models. Needs to use the same column name as defined as alias inside the staging model src_rsrc string optional datavault4dbt.rsrc_alias Name of the rsrc column inside the source models. Is optional, will use the global variable 'datavault4dbt.rsrc_alias'. Needs to use the same column name as defined as alias inside the staging model.