Overview - stonezhong/DataManager GitHub Wiki

Typical Lakehouse and Data Manager

Data Manager Features

Data Catalog: A data marketplace

Data Manager provides data catalog, which makes Data Manager a marketplace that connects data publishers and data consumers, where data publishers can publish datasets, assets and data consumers can find datasets they are interested, and explore datasets and assets. Here are detailed features:

  • For data publisher, it has a set of REST APIs that:
    • data publisher can publish dataset, set description, schema, sample data, schema description per column, primary key, uniqe key, foreign key.
    • dataset publisher can set constraints. (constraints helps others to understand the data)
    • publisher can maintain schema evolution by using major version, minor version
      • bump major version when breaking change made to schema
      • bump minor version when non-breaking change made to schema
      • data publisher should notify data consumer about the schema change ahead, and produce asset for both old schema and new schema for a period of time to allow data consumer to update their pipeline.
  • For data consumers, it allows them to:
    • a set of REST APIs that allow data consumer to find physical location from the abstract asset path. (same as RDBMS, when using RDBMS, you do not need to remember which file stores a table, you just deal with tables)
    • data consumer can browse all datasets.
    • data consumer can search datasets by schema information. For example, find all dataset that has a column.
    • data consumer can search datasets by glossary. If the dataset is associated with glossary, it will show up in the search result. (this is because publisher can refer to glossary when describe a dataset or a column)
    • data consumer can view dataset schema and sample data. -- This also helps data consumer to understand data
    • data consumer can view dataset metadata, such as publisher, description, etc. So data consumer can contact data publisher for any questions.

Data Application Catalog: a data application marketplace

Data Application is required to build Data Pipelines. There are some built-in Data Applications such as ExecuteSQL for pipelines just want to use Spark-SQL to do the data transformation, however, data engineers can publish Data Applications, these Data Applications can be used when building Data Pipelines.

You datalake build with Data Manager is vendor agnostic. Since it uses spark-etl to build Data Application, so the way to build data application, deploy data application and run data application is the same cross different Apache Spark verdor, for example you can easily migrate your Lakehouse from Apache Spark from AWS EMR to Microsoft Azure HDInsight.

A Data Pipeline Engine with self-serviceable Web UI

With Data Manager,

  • User can build Data Pipeline via Web UI
    • Most of time, users can use Spark-SQL to build data pipeline.
    • In rare cases, users can use Data Application to build data pipeline.
    • Data Manager creates Airflow DAG that represents the Pipeline automatically.
  • User can schedule data pipeline. Data Manager will run user defined Data Pipeline based on the user defined scheduling
  • Data pipeline can consume dataset, and produce assets and publish dataset and/or asset to the data catalog.

Data Manager Architecture

Typical pipeline use case:

  • User create and deploy data application. Then go to Data Manager UI to publish the new Data Application.
  • User create Data Pipeline, then set scheduling information about the new pipeline, include start time, frequency.
  • In Data Manager, a "Data Pipeline Scheduler" (we call it DPS) daemon will pick up the pipeline once it is due.
    • Since we allow user to specify "required asset" in Pipeline, so DPS won't trigger the pipeline if the required assets is not present.
    • The pipeline execution is handled by Airflow, Data Manager automatically creates Airflow DAG on behalf of the user.
    • If the pipeline need to read asset, it will lookup the Data Catalog to find the physical location of the requested asset and then load the asset into dataframe from there.
    • A pipeline may produce asset. After it write the asset to storage, it can publish the asset's physical location back to the data catalog service.
      • Once a dataset or asset is published, it becomes visible from all data consumers and other pipeline that requires the asset may be unblocked and gets kicked off by the DPS.

Glossary

Dataset

Dataset is a collection of assets of the same schema for the same purpose. For details, please see the definition.

Asset

Asset carries data and can be loaded as dataframe in Apache Spark. For details, please see the definition.

Data Publisher

Person who publish datasets and assets to the Data Manager.

Data Consumer

Person who need to read data from datasets.

Data Application

A platform agnostic Spark Application built with spark-etl package. Data Application can be used when building Data Pipeline

Data Pipeline

Represent a planned sequence of invocation of Data Applications with dependency specified.