Document one page - stonezhong/DataManager GitHub Wiki
A dataset is a collection of asserts of the same schema.
A dataset is uniquely identified by name, major version, and minor version.
- You can bump the minor version if you change the schema which is compatible with the schema before the change
- You need to bump the major version if you change the schema which is not compatible with the schema before the change
Dataset attributes:
Attribute | Description |
---|---|
name | Name of the dataset. |
major version | Major version of the dataset. |
minor version | Minor version of the dataset, start from 1. |
author | Person who owns the dataset, user should contact this person in case having any questions. |
description | Document of the dataset, it helps user to understand the dataset, usually written by author. |
team | Team who owns the dataset. |
Publish Time | Time when the dataset was published. |
Expiration Time | Time the dataset is expired. If missing, then it means the dataset is not expired. |
It is a string that refers to a dataset. It's format is name:major_version:minor_version. For example, "tradings:1.0:1" refers to dataset whose name is tradings, major version is 1.0 and minor version is 1.
Asset carries data and can be loaded as dataframe in Apache Spark.
Asset belongs to dataset, and is uniquely identified by asset path within the dataset.
Asset can have many revisions. Given an asset, only the highest revision can be active, while others are deleted. Revision starts at 0
An asset is a data object that can be loaded into dataframe in Apache Spark. For example, it can be a file in HDFS, or a object in AWS S3 buckets. It could be a parquet file, csv file, jsonl file, etc.
Asset attributes:
Attribute | Description |
---|---|
path | Uniquely identify an asset within a dataset. |
Publish Time | The time when the asset is published. |
Data Time | The time the asset is for. If you re-compute the asset after you fix a data bug, the data time should not change. |
Row Count | The row count for the dataframe loaded from this asset. |
Locations | List of locations for the storage data for this asset. It is possible an asset has multiple locations, for example, one primary location and one backup location. dataframe loaded from different locations here should be exactly the same. |
Producer type |
|
Upstream | List of asset paths that is involved in producing this asset. |
Downstream | List of asset paths that this asset is involved during producing. |
Application | Shows the name of the application that produced this asset. |
Arguments | Shows the arguments the application used when producing this asset. |
It is a string that refers to an asset.
Format 1: dataset_name:major_version:minor_version:path
. For example, "tradings:1.0:1:/2020-10-20" refers to asset, belongs to dataset whose name is tradings, major version is 1.0, minor version is 1 and path of the asset is "/2020-10-20".
Format 2: dataset_name:major_version:minor_version:path:revision
. For example, "tradings:1.0:1:/2020-10-20:0" refers to asset, belongs to dataset whose name is tradings, major version is 1.0, minor version is 1, path of the asset is "/2020-10-20" and revision is 0.
A cross-platform Spark Application build on top of spark-etl
- Data Application takes a JSON object as input
- Data Application returns a JSON object as output
- Data Application produces zero or more assets
- Data Application follows the spark-etl standard, you can use etl.py to build, deploy data applications.
- Therefore, Data Application can run on various Apache Spark platforms (such as AWS EMR, in-premise Apache Spark, Databricks or even PySpark) without modification.
- A Data Application can be used for data ingestion. Some other systems call such application "Connector"
A Pipeline has collection of tasks, and the dependencies between tasks forms a DAG.
- Spark-SQL task: The task executes bunch of SQL statements.
- Application task: The task executes a Data Application.
- Dummy: Do nothing, usually acting as placeholder for flow control.
Represent the execution of bunch of pipelines who has the same category, and the shared context
When you create tasks for pipeline, either it is a Spark-SQL task or Application task, their arguments could be jinja template, and the task's argument is only materialize when it bound to an execution using it's context.
In the Execution page, you can see:
- airflow DAG for each pipeline
- airflow DAG run for each pipeline execution
Execution can be created in two way:
- Scheduler may create execution automatically
- You can manually create execution, for testing purpose or perhaps doing one-off job.
Represent an execution generator
It generate execution at due time
Attributes:
Name | The name uniquely identifies a scheduler. |
Description | The document that helps user to understand this scheduler. |
Paused | If paused, the system won't create execution until it is unpaused. |
Catalog | Specify the catalog for the execution generated. |
Context | The context for the execution generated. It is a JSON string as jinja template, due is a datetime variable you can use as in the template |
Author | The owner of the scheduler |
Interval | Specify the frequency of the scheduler. You need to specify interval and unit. |
Start | The first due for the scheduler. |
End | The due must before "End" time, if End is missing, then the restriction for due is removed. |