Data Engineering Challenges - stonezhong/DataManager GitHub Wiki

Data Catalog

When you publishing your assets, you need to have clear description about your asset, otherwise, your data cannot be used or might be used in a wrong way. The data catalog should tell:

  • What is this dataset? A clear description of the dataset is needed.
  • What is the schema for this dataset?

Having a good catalog boost the value of your data!.

ETL Pipeline self-serviceable

You need a UI to allow user to create, schedule ETL pipeline, 95% of ETL pipeline can be written in SQL statement. Such UI enables non-engineer to be able to create ETL pipeline, so data engineer no longer the bottleneck.

Pipeline / Asset dependency

Your data platform should have a way to manage pipeline/asset dependency, this way, the scheduler can schedule your pipeline as soon as the required asset is ready.

Asset revision

Any asset should have revision, and user should feel comfortable to cache an asset, or cache the query result from an assert of a given revision -- since assert is immutable for a fixed revision.

Asset dependency (data lineage)

Your data platform should manage assert dependency, it helps in two ways:

  • You know if your asset is stale by looking at dependency.
  • You know how the asset was built upon, it helps you to understand the asset better

For example, table1(revision1) depends on table2(revision1), and latter, someone found a bug in ETL pipeline and fixed the table2, so
(1) table2(revision1) is deprecated
(2) table2(revision2) is created

Once you have the dependency data, you can tell table1(revision1) is stale and need refresh.

⚠️ **GitHub.com Fallback** ⚠️