Data Lake Concepts - stonezhong/DataManager GitHub Wiki

Index

Staging Area

This is an area you copy external data into. Usually, you will convert external data into asset. Here are some considerations:

  • In case the external data has quality problem, you can investigate the issue by looking at the data in staging area
  • If you data ingestion pipeline has any bugs, which result in bad asset being generated, you can debug your data ingestion pipeline by looking at the data in staging area
  • There might be cases you only consume part of the raw data when producing assets, and if you want to change your pipeline and consume more data, you can use the data in staging area for backfill purpose.

Challenges:

  • There might be cases the external data is changed while your staging area becomes stale. How to re-stage the data and update the asset downstream all the way to the reports is not easy. Also keep in mind, there should be a window, you only deal with raw data update with the window, for example, the window size should be reasonable, you likely not going to deal with raw data change happened a year ago.

Dataset

A dataset is a collection of assets that has the same purpose. For example, it could be "employee check in fact" for a company, that track all the employee check-in, check-out activities.

Dataset has major version and minor version, for the same major version and minor version, the schema is the same for all owned assets. And schema change between minor versions are backward compatible.

Here is an example of dataset: name="employee_punch_fact", major_version="1.0", minor_version=1

Asset

An asset is the basic unit that carries data, every asset belongs to a dataset. Asset is in a format that can be loaded into Apache Spark dataframe directly in a most efficiently way, for example, parquet format.

Asset can be physically located in various different storage, for example, it could be in HDFS, or AWS S3 buckets or in a table in RDBMS.

Asset is immutable for a given revision. However, change could happen in different revisions. This is common that a pipeline may generate a new revision for an asset if A) the pipeline has a bug result in bad asset and re-run has corrected the bug or B) the asset's upstream asset is updated which triggers the re-calculation of this asset.