Data Model for Data Lakes - ja-guzzle/guzzle

Background
General Flow
Overview
Data Model Guidelines
Data Modelling Technique
General Principles
Surrogate keys vs
Code tables
Dealt Lake
What is Delta Lake
Other aspecs

Background

Traditionally we have been dealing with (Enterprise-wide) Data Warehouse and Data Marts. This are usually table oriented and support batch kind of ingestion
The data in there is populated mainly from flat files, and relationship data
In the data lake realm, the VVV are prominent - we will be getting data at high velocity : It will be common place to get data from APIs and message queues on near real-time basis. The Volume of data will be huge and hence the data structure choosen has to have right balance of storage and performance (and ideally performance taking the precedence)
Variety again is common place in the data lake world - so the Data Model approach has to be provide agility
Usage of data in data lake is wide spread - you have multiple engines operating on this data - things like hive providing SQL infra on top of data stored in HDFS + metadata stored in , Spark directly manipulating the data directly from HDFS either thru core data integration API or ML (most of file format encapsulate the schema- example ORC does encapsulate the data types while parquete even has column names and data types)- this results in keeping the schema and data types quite simple and standard for all this tools to inter-operate

General Flow

soure system/db/files -> Source image layer (table should ideally have prefix to idenify them as sdl or stg or sri or src_xxx tables) -> Integration and Reusable data layer ( Any processed tabes/ aggregate tables/ fact-dim tables ) -> Usecases layer (any further de-norm tables)

All the three layer should keep data as Delta tables>
To populate the tables in "Reusable or Integraiton " layer and Usecase layer we can use Processing module. Its not mandatory to have tables in bth "Reusable or Integraiton " layer and Usecase layer , or go in that sequence, we can always bypass reusable data set /integration layer
Now to publish the data to SQL server, I dont prefer to use external table, rather we can use following - Use ingestion module to write them to native SQL server tables directly and use those for the PBI Import

Overview

Data model design is vast topic - essentially it the heart of the whole solution

Data Model Guidelines

The approach should be simple:

Data Modelling Technique

No need to force dimensional model - if one is not required the lets not have one. Example if you are getting data like: Sales order, Order Lines, Order Headers etc, let them remain so.
If the data
Don't enforce referential integrity as much possible
you can't avoid

General Principles

Natural keys
We can have concat keys or integration key - which concat natural keys to keep the joins simpler – but then it has to be consistent and in fact we should only have concat keys and not original natural keys
Mini dimension instead of having generic code lookup
Don’t have effective dated or SCD2 – but rather do simple daily snapshot where history tracking is required. However if there are requirement to know how many times address changed for a customer in last one year – then we can implement some effective dated- but then its to serve specific business logic and not general pater
Don’t use smart date or time keys (like yyyymmdd) – just use simple date and timestamp columns
Data model can have actual data types as per data values –but essentially we use : a. Use “string” and avoid varchar etc. / including varchar(1) for flag-
b. Use decimal(30,10) , bigint for large int and int for integers More details at: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
Keep away from struct and array and other composite data types
Partitionin – we have to approach this different for each layers like staging, foundation and usecases layers

Surrogate keys vs

This has wider impact and m
Usage of Surrogate keys have helped keep the final consumption simple - however people start using natural key
To avoid this - don't keep natural key anywhere other than "Dimension" and "Dimension Key" table

Code tables

Decision of using a generic code table has been always a a debate. The good thing of having them is that - they help

Dealt Lake

What is Delta Lake

Full guide available here: https://docs.databricks.com/delta/index.html
ad

Data Model for Data Lakes - ja-guzzle/guzzle_docs GitHub Wiki

Table of Contents

Background

General Flow

Overview

Data Model Guidelines

Data Modelling Technique

General Principles

Surrogate keys vs

Code tables

Dealt Lake

What is Delta Lake

Other aspecs

⚠️ GitHub.com Fallback ⚠️

Data Model for Data Lakes - ja-guzzle/guzzle_docs GitHub Wiki

Table of Contents

Background

General Flow

Overview

Data Model Guidelines

Data Modelling Technique

General Principles

Surrogate keys vs

Code tables

Dealt Lake

What is Delta Lake

Other aspecs

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️