Data Model for Data Lakes - ja-guzzle/guzzle_docs GitHub Wiki

Table of Contents

Background

  1. Traditionally we have been dealing with (Enterprise-wide) Data Warehouse and Data Marts. This are usually table oriented and support batch kind of ingestion
  2. The data in there is populated mainly from flat files, and relationship data
  3. In the data lake realm, the VVV are prominent - we will be getting data at high velocity : It will be common place to get data from APIs and message queues on near real-time basis. The Volume of data will be huge and hence the data structure choosen has to have right balance of storage and performance (and ideally performance taking the precedence)
  4. Variety again is common place in the data lake world - so the Data Model approach has to be provide agility
  5. Usage of data in data lake is wide spread - you have multiple engines operating on this data - things like hive providing SQL infra on top of data stored in HDFS + metadata stored in , Spark directly manipulating the data directly from HDFS either thru core data integration API or ML (most of file format encapsulate the schema- example ORC does encapsulate the data types while parquete even has column names and data types)- this results in keeping the schema and data types quite simple and standard for all this tools to inter-operate

General Flow

soure system/db/files -> Source image layer (table should ideally have prefix to idenify them as sdl or stg or sri or src_xxx tables) -> Integration and Reusable data layer ( Any processed tabes/ aggregate tables/ fact-dim tables ) -> Usecases layer (any further de-norm tables)

  1. All the three layer should keep data as Delta tables>

  2. To populate the tables in "Reusable or Integraiton " layer and Usecase layer we can use Processing module. Its not mandatory to have tables in bth "Reusable or Integraiton " layer and Usecase layer , or go in that sequence, we can always bypass reusable data set /integration layer

  3. Now to publish the data to SQL server, I dont prefer to use external table, rather we can use following - Use ingestion module to write them to native SQL server tables directly and use those for the PBI Import

Overview

  1. Data model design is vast topic - essentially it the heart of the whole solution

Data Model Guidelines

The approach should be simple:

Data Modelling Technique

  1. No need to force dimensional model - if one is not required the lets not have one. Example if you are getting data like: Sales order, Order Lines, Order Headers etc, let them remain so.

  2. If the data

  3. Don't enforce referential integrity as much possible

  4. you can't avoid

General Principles

  1. Natural keys
  2. We can have concat keys or integration key - which concat natural keys to keep the joins simpler – but then it has to be consistent and in fact we should only have concat keys and not original natural keys
  3. Mini dimension instead of having generic code lookup
  4. Don’t have effective dated or SCD2 – but rather do simple daily snapshot where history tracking is required. However if there are requirement to know how many times address changed for a customer in last one year – then we can implement some effective dated- but then its to serve specific business logic and not general pater
  5. Don’t use smart date or time keys (like yyyymmdd) – just use simple date and timestamp columns
  6. Data model can have actual data types as per data values –but essentially we use : a. Use “string” and avoid varchar etc. / including varchar(1) for flag-
    b. Use decimal(30,10) , bigint for large int and int for integers More details at: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
  7. Keep away from struct and array and other composite data types
  8. Partitionin – we have to approach this different for each layers like staging, foundation and usecases layers

Surrogate keys vs

  1. This has wider impact and m
  2. Usage of Surrogate keys have helped keep the final consumption simple - however people start using natural key
  3. To avoid this - don't keep natural key anywhere other than "Dimension" and "Dimension Key" table

Code tables

  1. Decision of using a generic code table has been always a a debate. The good thing of having them is that - they help

Dealt Lake

What is Delta Lake

  1. Full guide available here: https://docs.databricks.com/delta/index.html
  2. ad

Other aspecs

⚠️ **GitHub.com Fallback** ⚠️