example dw processing - vmware/versatile-data-kit GitHub Wiki

:warning: IN PROGRESS (Not working yet)

Prerequisites

Overview

In this example - we will use the Data Jobs Development Kit (VDK) to develop a processing Data Job (read the User Guide for explanation of the types of data jobs). The job will read data from the Super Collider Data Lake, process it (this is the 'transform' phase in the ETL terminology) and write the result to the Super Collider Data Warehouse.

You can follow along and run this example Data Job on your computer to get first hand experience with Data Jobs or you can use the code as a template and extend it to make a Data Job that processes some data of your own.

We will take a look into a Data Job with name: example-process-in-star-schema.

Below we will be describing the overwrite (scd1) template strategy. Please see other template strategies for updating warehouse tables in **Data Pipelines - Example use of templates**

We will process data ingested by the Ingest from REST API job. This job reads users objects from a public REST service and ingests them into the Data Lake. Our processing job will simplify the data by taking only the properties we need (id, name, username, email) and write the simplified users to the Data Warehouse.

Source

The table in the Data Lake that we will process is sc_code_samples__users and contains data in the following format:

sc_code_samples__users

id	name	username	email	address_street	address_suite	address_city	address_zipcode	address_geo_lat	address_get_lng	phone	website	company_name	company_catchPhrase	company_bs
1	Leanne Graham	Bret		Kulas Light	Apt. 556	Gwenborough	92998-3874	-37.3159	81.1496	1-770-736-8031 x56442	hildegard.org	Romaguera-Crona	Multi-layered client-server neural-net	harness real-time e-markets

Destination

The resulting simplified users data will be recorded in the Data Warehouse in dim_users table.

id	name	username
1	Leanne Graham	Bret
2	Ervin Howell	Antonette
3	Clementine Bauch	Samantha

File Structure

The data job consists of the following elements:

example-process-in-star-schema

example-process-in-star-schema
├── 10_create_processing_view.sql
├── 20_create_target_dimension.sql
├── 30_execute_template.py
└── config.ini
example-process-in-star-schema.keytab

Setup

We will create a view that will be the essence of the processing task. In this case, the view will simply get a subset of the users schema with the properties we need.

CREATE VIEW IF NOT EXISTS super_collider.vw_dim_users AS
SELECT id, name, username, email FROM history_staging.sc_code_samples__users

We will then create the table in the Data Warehouse where we will record the simplified objects.

CREATE TABLE IF NOT EXISTS super_collider.dim_users (id string, name string, username string, email string) stored as parquet

Processing

Finally, we will execute the template to load the data into the Data Warehouse. The template uses Slowly Changing Dimension Type 1 (new data overwrites old one) to insert the data (https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_1:_overwrite)

10_execute_template.py

def run(job_input: IJobInput):
 
    template_args = {'target_schema': 'super_collider',
                     'target_table': 'dim_users',
                     'source_schema': 'super_collider',
                     'source_view': 'vw_dim_users'}
 
    job_input.execute_template(template_name='scd1', template_args=template_args)

For a full list of the required prerequisites and available template parameters, please consult the load.dimension.scd1 template documentation.

Execute

Data jobs are executed with the 'vdk' command (you should have first installed VDK in your python virtualenv):

vdk run example-process-in-star-schema

Upon successful local execution of the job, you will see output similar to this one:

2019-10-17 10:48:11,110=1571298491[VDK] example-process-in-star-schema [INFO ] vacloud.vdk.command_run              command_run.py:88   run             [OpId:1571298370.144615]- Execution of example-process-in-star-schema completed successfully. Result is:
{
  "dataset_name": "example-process-in-star-schema",
  "execution_id": "1571298370.144615",
  "start_time": "2019-10-17T07:46:10Z",
  "end_time": "2019-10-17T07:48:11Z",
  "steps_list": [
    {
      "name": "10_execute_template.py",
      "type": "python",
      "start_time": "2019-10-17T07:46:10Z",
      "end_time": "2019-10-17T07:48:11Z",
      "status": "success",
      "details": null
    }
  ]
}
2019-10-17 10:48:11,110=1571298491[VDK] example-process-in-star-schema [INFO ] vacloud.vdk.command_run              command_run.py:91   run             [OpId:1571298370.144615]- Data Job execution finished successfully.

Result

After the execution of this data job, we will have the simplified users from the external REST API written in the Data Warehouse. Since we are using Slowly Changing Dimension Type 1, the new records will replace the old ones.

Source Code

The complete source of the data job can be seen in the data jobs repository: https://todo/tree/master/example-process-in-star-schema.