Document one page - stonezhong/DataManager GitHub Wiki

Indexes

Concepts

Dataset

A dataset is a collection of asserts of the same schema.

A dataset is uniquely identified by name, major version, and minor version.

  • You can bump the minor version if you change the schema which is compatible with the schema before the change
  • You need to bump the major version if you change the schema which is not compatible with the schema before the change

Dataset attributes:

Attribute Description
name Name of the dataset.
major version Major version of the dataset.
minor version Minor version of the dataset, start from 1.
author Person who owns the dataset, user should contact this person in case having any questions.
description Document of the dataset, it helps user to understand the dataset, usually written by author.
team Team who owns the dataset.
Publish Time Time when the dataset was published.
Expiration Time Time the dataset is expired. If missing, then it means the dataset is not expired.

Dataset Path

It is a string that refers to a dataset. It's format is name:major_version:minor_version. For example, "tradings:1.0:1" refers to dataset whose name is tradings, major version is 1.0 and minor version is 1.

Asset

Asset carries data and can be loaded as dataframe in Apache Spark.

Asset belongs to dataset, and is uniquely identified by asset path within the dataset.

Asset can have many revisions. Given an asset, only the highest revision can be active, while others are deleted. Revision starts at 0

An asset is a data object that can be loaded into dataframe in Apache Spark. For example, it can be a file in HDFS, or a object in AWS S3 buckets. It could be a parquet file, csv file, jsonl file, etc.

Asset attributes:

Attribute Description
path Uniquely identify an asset within a dataset.
Publish Time The time when the asset is published.
Data Time The time the asset is for. If you re-compute the asset after you fix a data bug, the data time should not change.
Row Count The row count for the dataframe loaded from this asset.
Locations List of locations for the storage data for this asset. It is possible an asset has multiple locations, for example, one primary location and one backup location. dataframe loaded from different locations here should be exactly the same.
Producer type
  • Spark-SQL : The asset is produced by a Spark-SQL task, you can see all SQL steps that produces this asset
  • Application: The asset is produced by an application task, you can see the argument passed to the app that produced this asset
Upstream List of asset paths that is involved in producing this asset.
Downstream List of asset paths that this asset is involved during producing.
Application Shows the name of the application that produced this asset.
Arguments Shows the arguments the application used when producing this asset.

Asset Path

It is a string that refers to an asset.

Format 1: dataset_name:major_version:minor_version:path. For example, "tradings:1.0:1:/2020-10-20" refers to asset, belongs to dataset whose name is tradings, major version is 1.0, minor version is 1 and path of the asset is "/2020-10-20".

Format 2: dataset_name:major_version:minor_version:path:revision. For example, "tradings:1.0:1:/2020-10-20:0" refers to asset, belongs to dataset whose name is tradings, major version is 1.0, minor version is 1, path of the asset is "/2020-10-20" and revision is 0.

Data Application

A cross-platform Spark Application build on top of spark-etl

  • Data Application takes a JSON object as input
  • Data Application returns a JSON object as output
  • Data Application produces zero or more assets
  • Data Application follows the spark-etl standard, you can use etl.py to build, deploy data applications.
    • Therefore, Data Application can run on various Apache Spark platforms (such as AWS EMR, in-premise Apache Spark, Databricks or even PySpark) without modification.
  • A Data Application can be used for data ingestion. Some other systems call such application "Connector"

Pipeline

A Pipeline has collection of tasks, and the dependencies between tasks forms a DAG.

  • Spark-SQL task: The task executes bunch of SQL statements.
  • Application task: The task executes a Data Application.
  • Dummy: Do nothing, usually acting as placeholder for flow control.

Execution

Represent the execution of bunch of pipelines who has the same category, and the shared context

When you create tasks for pipeline, either it is a Spark-SQL task or Application task, their arguments could be jinja template, and the task's argument is only materialize when it bound to an execution using it's context.

In the Execution page, you can see:

  • airflow DAG for each pipeline
  • airflow DAG run for each pipeline execution

Execution can be created in two way:

  • Scheduler may create execution automatically
  • You can manually create execution, for testing purpose or perhaps doing one-off job.

Scheduler

Represent an execution generator

It generate execution at due time

Attributes:

Name The name uniquely identifies a scheduler.
Description The document that helps user to understand this scheduler.
Paused If paused, the system won't create execution until it is unpaused.
Catalog Specify the catalog for the execution generated.
Context The context for the execution generated. It is a JSON string as jinja template, due is a datetime variable you can use as in the template
Author The owner of the scheduler
Interval Specify the frequency of the scheduler. You need to specify interval and unit.
Start The first due for the scheduler.
End The due must before "End" time, if End is missing, then the restriction for due is removed.
⚠️ **GitHub.com Fallback** ⚠️