Indexes

Concepts
- Dataset
  - Dataset Path
- Asset
  - Asset Path
- Data Application
- Pipeline
- Execution
- Scheduler

Concepts

Dataset

A dataset is a collection of asserts of the same schema.

A dataset is uniquely identified by name, major version, and minor version.

You can bump the minor version if you change the schema which is compatible with the schema before the change
You need to bump the major version if you change the schema which is not compatible with the schema before the change

Dataset attributes:

Attribute	Description
name	Name of the dataset.
major version	Major version of the dataset.
minor version	Minor version of the dataset, start from 1.
author	Person who owns the dataset, user should contact this person in case having any questions.
description	Document of the dataset, it helps user to understand the dataset, usually written by author.
team	Team who owns the dataset.
Publish Time	Time when the dataset was published.
Expiration Time	Time the dataset is expired. If missing, then it means the dataset is not expired.

Dataset Path

It is a string that refers to a dataset. It's format is name:major_version:minor_version. For example, "tradings:1.0:1" refers to dataset whose name is tradings, major version is 1.0 and minor version is 1.

Asset

Asset carries data and can be loaded as dataframe in Apache Spark.

Asset belongs to dataset, and is uniquely identified by asset path within the dataset.

Asset can have many revisions. Given an asset, only the highest revision can be active, while others are deleted. Revision starts at 0

An asset is a data object that can be loaded into dataframe in Apache Spark. For example, it can be a file in HDFS, or a object in AWS S3 buckets. It could be a parquet file, csv file, jsonl file, etc.

Asset attributes:

Attribute	Description
path	Uniquely identify an asset within a dataset.
Publish Time	The time when the asset is published.
Data Time	The time the asset is for. If you re-compute the asset after you fix a data bug, the data time should not change.
Row Count	The row count for the dataframe loaded from this asset.
Locations	List of locations for the storage data for this asset. It is possible an asset has multiple locations, for example, one primary location and one backup location. dataframe loaded from different locations here should be exactly the same.
Producer type	Spark-SQL : The asset is produced by a Spark-SQL task, you can see all SQL steps that produces this asset Application: The asset is produced by an application task, you can see the argument passed to the app that produced this asset
Upstream	List of asset paths that is involved in producing this asset.
Downstream	List of asset paths that this asset is involved during producing.
Application	Shows the name of the application that produced this asset.
Arguments	Shows the arguments the application used when producing this asset.

Asset Path

It is a string that refers to an asset.

Format 1: dataset_name:major_version:minor_version:path. For example, "tradings:1.0:1:/2020-10-20" refers to asset, belongs to dataset whose name is tradings, major version is 1.0, minor version is 1 and path of the asset is "/2020-10-20".

Format 2: dataset_name:major_version:minor_version:path:revision. For example, "tradings:1.0:1:/2020-10-20:0" refers to asset, belongs to dataset whose name is tradings, major version is 1.0, minor version is 1, path of the asset is "/2020-10-20" and revision is 0.

Data Application

A cross-platform Spark Application build on top of spark-etl

Data Application takes a JSON object as input
Data Application returns a JSON object as output
Data Application produces zero or more assets
Data Application follows the spark-etl standard, you can use etl.py to build, deploy data applications.
- Therefore, Data Application can run on various Apache Spark platforms (such as AWS EMR, in-premise Apache Spark, Databricks or even PySpark) without modification.
A Data Application can be used for data ingestion. Some other systems call such application "Connector"

Pipeline

A Pipeline has collection of tasks, and the dependencies between tasks forms a DAG.

Spark-SQL task: The task executes bunch of SQL statements.
Application task: The task executes a Data Application.
Dummy: Do nothing, usually acting as placeholder for flow control.

Execution

Represent the execution of bunch of pipelines who has the same category, and the shared context

When you create tasks for pipeline, either it is a Spark-SQL task or Application task, their arguments could be jinja template, and the task's argument is only materialize when it bound to an execution using it's context.

In the Execution page, you can see:

airflow DAG for each pipeline
airflow DAG run for each pipeline execution

Execution can be created in two way:

Scheduler may create execution automatically
You can manually create execution, for testing purpose or perhaps doing one-off job.

Scheduler

Represent an execution generator

It generate execution at due time

Attributes:

Name	The name uniquely identifies a scheduler.
Description	The document that helps user to understand this scheduler.
Paused	If paused, the system won't create execution until it is unpaused.
Catalog	Specify the catalog for the execution generated.
Context	The context for the execution generated. It is a JSON string as jinja template, `due` is a `datetime` variable you can use as in the template
Author	The owner of the scheduler
Interval	Specify the frequency of the scheduler. You need to specify interval and unit.
Start	The first due for the scheduler.
End	The due must before "End" time, if End is missing, then the restriction for due is removed.

Document one page - stonezhong/DataManager GitHub Wiki

Indexes

Concepts

Dataset

Dataset Path

Asset

Asset Path

Data Application

Pipeline

Execution

Scheduler

⚠️ GitHub.com Fallback ⚠️

Document one page - stonezhong/DataManager GitHub Wiki

Indexes

Concepts

Dataset

Dataset Path

Asset

Asset Path

Data Application

Pipeline

Execution

Scheduler

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️