Data Manager Introduction - stonezhong/DataManager GitHub Wiki

Indexes

Brief

Data Manager is a one stop shop as Apache Spark based data lake management platform. It covers all the needs for data team that is using Apache Spark for batching process. However, data visualization and notebook (like JupyterHub) is not included in Data Manager. For more informations, please read Data Manager wiki Pages

Overall Working Model

Pipeline Execution Diagram

Supported Platforms

You setup your own Apache Spark Cluster.
Use PySpark package, fully compatible to other spark platform, allows you to test your pipeline in a single computer.
You host your spark cluster in databricks
You host your spark cluster in Amazon AWS EMR
You host your spark cluster in Google Cloud
You host your spark cluster in Microsoft Azure HDInsight
You host your spark cluster in Oracle Cloud Infrastructure, Data Flow Service
You host your spark cluster in IBM Cloud

Features

Cloud Provider Agnostic

Data Manager support many Apache Spark providers, including AWS EMR, Microsoft Azure HDInsight, etc. For complete list, see readme.md.

  • Pipelines, Data Applications are sit on top of an abstraction layer that hides the cloud provider details.
  • Moving your Data Manager based data lake from one cloud provider to another just need configuration change, no code change is required.

We achieved this by introducing “job deployer” and “job submitter” via python package spark-etl.

  • By using “spark_etl.vendors.oracle.DataflowDeployer” and “spark_etl.vendors.oracle.DataflowJobSubmitter”, your can run Data Manager in OCI DataFlow, here is an example.
  • By using “spark_etl.vendors.local.LocalDeployer” and “spark_etl.vendors.local.PySparkJobSubmitter”, you can run Data Manager with PySpack with your laptop, here is an example
  • By using "spark_etl.deployers.HDFSDeployer" and "spark_etl.job_submitters.livy_job_submitter.LivyJobSubmitter", you can run Data Manager with your on premise spark cluster, here is an example

Even you are using a public cloud for your production data lake, you can test your pipeline and data application using PySpark on your laptop with down scaled sample data with exactly the same code.

Create Data Pipeline with UI

It allows non-programmer to create pipeline via UI, the pipeline can have multiple “tasks”, each task is either executing a series of Spark-SQL statements or launching a Data Application.

alt text

Schedule Pipelines with UI

User can schedule the execution of pipeline with UI. It will execute all the pipelines related to the category of the scheduler, when these pipelines are invoked, they will use shared context defined in this scheduler.

alt text

Data Catalog

It provides a data catalog which tracks all datasets and assets. The data catalog has a REST API for client to query and register dataset and asset.

Data Lineage

Data Manager keeps track the data lineage info. When you view an asset, you will see:

  • The actual SQL statement that produces this asset if the asset is produced by Spark-SQL task
  • The application name and arguments if the asset is produced by a data application
  • Upstream assets, list of assets being used when producing this asset
  • Downstream assets, list of assets that require this asset when produced

With these information, you understands where data comes from, where data goes to and how it is produced.

alt text

Pipeline / Asset dependency

You can specify a list of assets as “require” for a pipeline. The scheduler will only invoke the pipeline when all the required assets shows up.

Note, the required asset here in the example is tradings:1.0:1:/{{dt}}, this is actually a jinja template. "dt" is part of the rendering context belongs to the scheduler.

alt text

View Schema for Dataset

In the Dataset Browser UI, you can see the schema for any dataset. Here is an example:

alt text

Assets are immutable

Assets are immutable, if you want to republish an asset (maybe the earlier asset has a data-bug), it will bump the revision. If you cache the asset or derived data from the asset, you can check the asset revision to decide whether you need to update your cache.

Each asset, only the latest revision might be active, and the metadata tracks all the revisions.

This page also shows the revision history, so you can see if the asset has been corrected or not, and when.

alt text

Decouple asset’s name from storage location

As you can see, below is a task inside a pipeline, it imports an asset tradings:1.0:1/{{dt}} and call it “tradings”, then run a SQL statement. User do not even need to know where is the storage location of the asset, Data Manager automatically resolve the asset name to storage location and load the asset from it.

This also make it easy to migrate data lake between cloud providers since your pipeline is not tied to the storage location of assets.

alt text alt text

If you go to the asset page, it shows the storage location for asset, for example:

alt text

⚠️ **GitHub.com Fallback** ⚠️