Data Pipeline Best Practices - stonezhong/DataManager GitHub Wiki

Index

Laverage data catalog to load asset

You should not load asset from physical path, such as code below. This make your pipeline deeply coupled with the physical storage location and layouer.

# Do not do this!
df = spark.read.parquet("hdfs://mycluster/foo/bar.parquet")

Instead, you should do things like below, you have benefits like:

  • If you decide to change the data location, your pipeline stay unbroken
  • You can have a test environment where the actual data points to mock data, so you can test your pipeline with small scale of data.
df = catalog_client.load_asset("daily_trading:1.0:0:/2020-11-05")

Pipeline should declare assets it depend on

Your ETL pipeline in general might depend on some input datasets, except it is a pipeline for data ingestion. It is always a good idea to declare the assets you needed to run the pipeline, so the pipeline can be scheduled in the appropriate time.

You need to revision your asset

Everyone make mistakes. If you found a data bug in an asset, you should not delete it and replace it with the correct one. Your catalog service should allow you to submit a new revision, so you can:

  • Those pipeline who consumes the old revision assets should re-run and consume the new revision of the assets
    • If the pipeline need to update the output, it should also bump the output asset revision, so the system can trigger other pipelines, until no assets was computed from the source asset that is out-dated.

Try to building your pipeline using SQL language as much as possible

Custom build data applications always bring in extra complexity. Having your pipeline implemented in SQL language, you have the following benefit:

  • You have the flexibility to move your pipeline to other platform, since most of those platform support SQL, such as Apache Spark, Snowflake, Apache Hive, Presto, Apache Presto, RDBMS, etc. (although they are not 100% compatible, but the migration cost is still significant lower)
  • It is easier to code review your pipeline in SQL than in customized scala or pyspark code.

Your pipeline should be idempotent

User should feel safe to re-run your pipeline to recover from an incident without any side effect.

The output of your pipeline should not change if the input of the pipeline is not changing, unless you fix a bug in the pipeline, in that case, it is literally a new pipline.

Write reusable general purpose Data Applications

I suggest you to use the spark-etl package.

Do not write too many one-time data application, try to find a pattern and write reusable, general purpose Data Application that is well tested, document it well so everyone can use these data applications.

You can also use spark-etl package to develop your data application, the benefits are:

  • You are not vendor locked-in, you can deploy your application in your own spark clsuter, or in OCI Dataflow
    • You can even test your data application using pyspark package and test it in your dev machine with small scale data.