Big List of Standards - OptimalBI/optimal-data-engine-mssql GitHub Wiki

The sections below describe most of standards we have for Data Vault objects in the Optimal Data Engine. Whatever is on this web page is our current version.

Databases

ODE Data Vault solution will have at least three databases, configuration, stage and vault itself. Semantic layer, i.e. Data Marts, could be implemented in a separate database.

ODE databases names:

  • ODE_Config for configuration database
  • ODE_Vault, ODE_Data_Vault or ODE_Business_Vault for the Data Vault
  • For staging database, replace word “vault” with “stage” in the ODE Data Vault database name, e.g. ODE_Stage, ODE_Business_Stage etc.

Naming Conventions

Raw and Business Vaults objects get separated by the naming convention.

Raw Vault Business Vault
Hub schema RawHub (hubs that don’t represent a business concept on its own, but data is used in calculations) Hub
Hub name <source_system>__<source_schema>__<source_table_name>. E.g. CRM__Sales__Order_Header Aligned with the business concept, pascal case. Make it short. E.g. Customer, Product.
Hub key Aligned with the source column name Two fields. First one is source system identifier, char(4). Second on is a unique business key (concatenated in case of multipart key), varchar(128). Name same as the hub name.
Satellite schema RawSat BusSat
Satellite name <source_system>__<source_schema>__<source_table_name> Start with the hub name, then meaningful name for the group of attributes represented by satellite. E.g. Customer_DeliveryAddress.
Satellite fields Aligned with the source field names Meaningful names, pascal case. E.g. FirstName.
Link schema RawLnk (Links in Raw Vault are not actually required) Lnk
Link name Source system first, then source tables and relationship between them. E.g. CRM__Order_Header__Customer_Payer Meaningful name describing the relationship, ideally noun + verb + noun separated by underscore, e.g. Customer_Make_Order, Order_Include_Product
Link fields Aligned with hub keys Aligned with hub keys. In case of SAL or HAL, use adjectives, e.g. MasterCustomerKey, BaseCustomerKey

Release Number

Each configuration record must have a release key. Release number is composed from the Sprint start date (YYYYMMDD) and the next available sequence number. E.g. 2016082201, 2016082202. Release could include different types of configuration records. Make release as small as possible. By the end of the Sprint all the finished releases should be pushed to the next environment, e.g. from DEV to TEST. Unfinished releases could be backed out or renamed with the new Sprint sequence number.
Full release number is Sprint start date and 00 as a sequence number or without any sequence number to make it outstanding among incremental release numbers, e.g. 2018031900 or 20180304. Full releases also start from DEV environment for release keys consistency. Full release process is described here.

Change Management

Although any ETL tool is acceptable, generally business logic is implemented via the stored procedures in Stage area. Don't use views. Stage procedure has the same name as satellite with the "usp_" prefix. Stored procedure populates the stage table of the same name. Stored procedures code is versioned by Git.

Typically stored procedure includes:

  • Drop the stage table code.
  • Common table expressions with all the source tables for the query.
  • Select statement that writes data into the stage table. All the column transformations happen here.
  • Stage table could have one input parameter in the case where the same procedure supports Full and Delta load.
  • Convert all source fields to the required data type explicitly.
Business stage table includes only those source fields which are required to be in the Data Vault. All the stage tables have a field dv_stage_datetime defaulted to the stage table load timestamp. This field is required for the audit purpose only and is excluded from the configuration.

If business rule has been changed significantly, stored procedure and staging table get versioned within the database. New copy of the staging object gets the version number suffix, e.g. "_v001". Dependent Vault objects are switched to the new version of stage table. In case if objects require to be switched off, flag it as "retired" in configuration and change the stored procedure to stop populating the object with data.

Scheduling the Load

Stage tables should be scheduled to update the Data Vault. Scheduled tables are loaded in parallel. Most of the tables are supposed to run in the queue "001"; queue "002" is for heavy tables with complicate business logic. However, SSIS packages could only be executed in queue called "Agent001". There could be a one or multiple schedules. For example, load all the Raw tables in the first schedule and start applying business rules after that. However, schedule hierarchy guarantees the correct load order within the load.

Schedule itself should be initiated by SQL Agent or any external trigger. SSIS packages could only be executed by SQL Agent.

Miscellaneous

Always convert hub key field to a string data type.

All the satellites should be configured to have a columnstore index. However, this feature is only available for SQL Server Enterprise Edition.

Every link has a satellite. Satellite on a link contain hub keys only, no context is allowed.

Good old date dimension is represented by the Date reference ensemble. Hub key is date (not an integer).
⚠️ **GitHub.com Fallback** ⚠️