dbt Scheduling dbt pipelines with Apache Airflow - iff133/first GitHub Wiki

Apache Airflow

  • Airflow is a platform to programmatically author, schedule and monitor workflows.

  • Airflow helps to automate scripts in order to perform tasks.

  • Principles:

    • Dynamic: Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writing code that instantiates pipelines dynamically.
    • Scalable: Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers. Airflow is ready to scale to infinity.
    • Extensible: Easily define your own operators, executors and extend the library so that it fits the level of abstraction that suits your environment.
    • Elegant: Airflow pipelines are lean and explicit. Parametrizing your scripts is built into its core using the powerful Jinja templating engine.
  • In Airflow all workflows are DAGs (directed acyclic graphs). A DAG consists of operators. An operator defines an individual task that needs to be performed. There are different types of operators available( As given on Airflow Website):

    • BashOperator - executes a bash command
    • PythonOperator - calls an arbitrary Python function
    • EmailOperator - sends an email
    • SimpleHttpOperator - sends an HTTP request
    • MySqlOperator, SqliteOperator, PostgresOperator, MsSqlOperator, OracleOperator, JdbcOperator, etc. - executes a SQL command
    • Sensor - waits for a certain time, file, database row, S3 key, etc…
    • Custom operators are also possible to implement

Instalation and Setup

# airflow needs a home, ~/airflow is the default,
# but you can lay foundation somewhere else if you prefer
# (optional)
export AIRFLOW_HOME=~/airflow

# install from pypi using pip
pip install apache-airflow

# initialize the database
airflow initdb

# start the web server, default port is 8080
airflow webserver -p 8080

# start the scheduler
airflow scheduler

# visit localhost:8080 in the browser and enable the example dag in the home page
  • Upon running these commands, Airflow will create the $AIRFLOW_HOME folder and lay an airflow.cfg file with defaults that get you going fast.

  • You can inspect the file either in $AIRFLOW_HOME/airflow.cfg, or through the UI in the Admin->Configuration menu. The PID file for the webserver will be stored in $AIRFLOW_HOME/airflow-webserver.pid or in /run/airflow/webserver.pid if started by systemd.

  • Out of the box, Airflow uses a sqlite database, which you should outgrow fairly quickly since no parallelization is possible using this database backend. It works in conjunction with the airflow.executors.sequential_executor.SequentialExecutor which will only run task instances sequentially. While this is very limiting, it allows you to get up and running quickly and take a tour of the UI and the command line utilities.

From tutorial

  • pip install apache-airflow
  • create a folder and set it as AIRFLOW_HOME. In my case it is airflow_home. Once created you will call export command to set it in the path. (`
  • Make sure you are a folder above of airflow_home before running the export command.
  • Within airflow_home you will create another folder to keep DAGs. Call it dags
  • Now you have to call airflow initdb within airflow_home folder. Once it’s done it creates airflow.cfg and unitests.cfg.
  • airflow.db is an SQLite file to store all configuration related to run workflows. airflow.cfg is to keep all initial settings to keep things running.
  • So far so good, now without wasting any time let’s start the web server: airflow webserver
  • Now when you visit 0.0.0.0:8080 it shows a screen like: image
    • You can see a bunch of entries here. These are the example shipped with the Airflow installation. You can turn them off by visiting airflow.cfg file and set load_examples to FALSE
  • DAG Runs tells how many times a certain DAG has been executed.
  • Recent Tasks tells which task out of many tasks within a DAG currently running and what’s the status of it.
  • Schedule is responsible at what time this certain DAG should be triggered.
  • Example of a DAG: image
    • You can see rectangular boxes representing a task.
    • You can also see different color boxes on the top right of the greyed box, named: success, running, failed etc. These are legends. In the picture above you can all boxes have a green border, still, if you are unsure then hover your mouse on success legend and you will see a screen like below: image

References:

https://towardsdatascience.com/getting-started-with-apache-airflow-df1aa77d7b1b