Data Engineering Challenges - stonezhong/DataManager GitHub Wiki
When you publishing your assets, you need to have clear description about your asset, otherwise, your data cannot be used or might be used in a wrong way. The data catalog should tell:
- What is this dataset? A clear description of the dataset is needed.
- What is the schema for this dataset?
Having a good catalog boost the value of your data!.
You need a UI to allow user to create, schedule ETL pipeline, 95% of ETL pipeline can be written in SQL statement. Such UI enables non-engineer to be able to create ETL pipeline, so data engineer no longer the bottleneck.
Your data platform should have a way to manage pipeline/asset dependency, this way, the scheduler can schedule your pipeline as soon as the required asset is ready.
Any asset should have revision, and user should feel comfortable to cache an asset, or cache the query result from an assert of a given revision -- since assert is immutable for a fixed revision.
Your data platform should manage assert dependency, it helps in two ways:
- You know if your asset is stale by looking at dependency.
- You know how the asset was built upon, it helps you to understand the asset better
For example, table1(revision1) depends on table2(revision1), and latter, someone found a bug in ETL pipeline and fixed the table2, so
(1) table2(revision1) is deprecated
(2) table2(revision2) is created
Once you have the dependency data, you can tell table1(revision1) is stale and need refresh.