Airflow - dennisholee/notes GitHub Wiki

  • Airflow scheduler follows the heartbeat interval and iterates through all DAGs and calculates their next scheduled time and compares it with wall clock time to examine whether a given DAG should be triggered or not.
  • start_date is simply the date a DAG should be included in the eyes of the Airflow scheduler

Src: https://towardsdatascience.com/airflow-schedule-interval-101-bbdda31cc463

  • backfilling allows you to execute a new DAG for historical schedule intervals that occurred in the past
  • Airflow will schedule the first execution of our DAG to run at the first scheduled interval after the start date (start + interval)

Development

Installation

apt update
apt install sqlite3
apt install python3-pip
pip install apache-airflow
# Airflow needs a home. `~/airflow` is the default, but you can put it
# somewhere else if you prefer (optional)
export AIRFLOW_HOME=~/airflow

# Install Airflow using the constraints file
AIRFLOW_VERSION=2.4.0
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
# For example: 3.7
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
# For example: https://raw.githubusercontent.com/apache/airflow/constraints-2.4.0/constraints-3.7.txt
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"

# The Standalone command will initialise the database, make a user,
# and start all components for you.
airflow standalone

# Visit localhost:8080 in the browser and use the admin account details
# shown on the terminal to login.
# Enable the example_bash_operator dag in the home page

Src: https://airflow.apache.org/docs/apache-airflow/stable/start.html

Common Commands

Get airflow environment config airflow info

Get airflow environment config for DAG folder airflow info | grep dags_folder

Get list of DAGS airflow dags list

Get list of task under a DAG airflow tasks list {DAG_ID}

Test DAG's task airflow tasks test {DAG_ID} {TASK_ID} now or airflow tasks test {DAG_ID} {TASK_ID} {YYYY-MM-DD}

Development Environment

SSH tunnel to access remote Airflow Web ssh {username}@{host} -L 8080:{host}:8080 -N

Airflow Configuration

  • scheduler_heartbeat_sec

Bootstrap Issues

Can't connect to ('::', 8793)

  1. Check port isn't used by another service netstat -tulpn
  2. Change the worker port in airflow.cfg worker_log_server_port = 8794