Data Model for Data Lakes - ja-guzzle/guzzle_docs GitHub Wiki
- Background
- General Flow
- Overview
- Data Model Guidelines
- Data Modelling Technique
- General Principles
- Surrogate keys vs
- Code tables
- Dealt Lake
- What is Delta Lake
- Other aspecs
- Traditionally we have been dealing with (Enterprise-wide) Data Warehouse and Data Marts. This are usually table oriented and support batch kind of ingestion
- The data in there is populated mainly from flat files, and relationship data
- In the data lake realm, the VVV are prominent - we will be getting data at high velocity : It will be common place to get data from APIs and message queues on near real-time basis. The Volume of data will be huge and hence the data structure choosen has to have right balance of storage and performance (and ideally performance taking the precedence)
- Variety again is common place in the data lake world - so the Data Model approach has to be provide agility
- Usage of data in data lake is wide spread - you have multiple engines operating on this data - things like hive providing SQL infra on top of data stored in HDFS + metadata stored in , Spark directly manipulating the data directly from HDFS either thru core data integration API or ML (most of file format encapsulate the schema- example ORC does encapsulate the data types while parquete even has column names and data types)- this results in keeping the schema and data types quite simple and standard for all this tools to inter-operate
soure system/db/files -> Source image layer (table should ideally have prefix to idenify them as sdl or stg or sri or src_xxx tables) -> Integration and Reusable data layer ( Any processed tabes/ aggregate tables/ fact-dim tables ) -> Usecases layer (any further de-norm tables)
-
All the three layer should keep data as Delta tables>
-
To populate the tables in "Reusable or Integraiton " layer and Usecase layer we can use Processing module. Its not mandatory to have tables in bth "Reusable or Integraiton " layer and Usecase layer , or go in that sequence, we can always bypass reusable data set /integration layer
-
Now to publish the data to SQL server, I dont prefer to use external table, rather we can use following - Use ingestion module to write them to native SQL server tables directly and use those for the PBI Import
- Data model design is vast topic - essentially it the heart of the whole solution
The approach should be simple:
-
No need to force dimensional model - if one is not required the lets not have one. Example if you are getting data like: Sales order, Order Lines, Order Headers etc, let them remain so.
-
If the data
-
Don't enforce referential integrity as much possible
-
you can't avoid
- Natural keys
- We can have concat keys or integration key - which concat natural keys to keep the joins simpler – but then it has to be consistent and in fact we should only have concat keys and not original natural keys
- Mini dimension instead of having generic code lookup
- Don’t have effective dated or SCD2 – but rather do simple daily snapshot where history tracking is required. However if there are requirement to know how many times address changed for a customer in last one year – then we can implement some effective dated- but then its to serve specific business logic and not general pater
- Don’t use smart date or time keys (like yyyymmdd) – just use simple date and timestamp columns
- Data model can have actual data types as per data values –but essentially we use :
a. Use “string” and avoid varchar etc. / including varchar(1) for flag-
b. Use decimal(30,10) , bigint for large int and int for integers More details at: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types - Keep away from struct and array and other composite data types
- Partitionin – we have to approach this different for each layers like staging, foundation and usecases layers
- This has wider impact and m
- Usage of Surrogate keys have helped keep the final consumption simple - however people start using natural key
- To avoid this - don't keep natural key anywhere other than "Dimension" and "Dimension Key" table
- Decision of using a generic code table has been always a a debate. The good thing of having them is that - they help
- Full guide available here: https://docs.databricks.com/delta/index.html
- ad