Introduction to Apache Airflow for DLME - sul-dlss/dlme-airflow GitHub Wiki

Apache Airflow for DLME

Introduction to what airflow is and how it is being used for DLME

Resources

Terms

DAG Dashboard

DLME Airflow dashboard

Important features

  • Enable/Disable DAG: A DAG will not run (even manually) unless enabled
  • DAG name & tags: Clicking on the label will display the DAG
  • Runs: Displayed in Successful/Running/Failed order. Clicking each will display a list of dag runs
  • Schedule: How is this DAG scheduled - following cron syntax and special commands (i.e. @yearly, @once)
  • Last Run: Links to DAG view from the most recent DAG run
  • Play: Manually trigger the DAG
  • Reload: Refresh the DAG definition
  • Delete: Delete the DAG

Default DAG Display - Tree View

NOTE: This is the default display when navigating into a DAG. As a DAG grown in complexity, the task display can become hard to understand in the tree view - though the grid view of dag runs can be helpful when debugging.

DAG Display - Default, Tree View

DAG Display - Graph View

This DAG view is generally more appealing and understanding. It displays and updates as a DAG runs, therefore the visual representation of where in the task list a particular dag run is can be very helfpul.

Here we see a simple DAG with five (5) tasks:

  • configure_git
  • validate_metadata_folder
  • clone_metadata
  • pull_metadata
  • finished_pulling

The graph display makes it clear than validate_metadata_folder results in a branch between clone_metadata and pull_metadata and runs after configure_git. The final task, finished_pulling is a DummyOperator - a place holder task used for control flow.

DAG Display - Graph View

The border color of the tasks in this display is important, and a key is provided at the top of the display. Here we see that configure_git, validate_metadata_folder, clone_metadata, and finished_pulling each have a dark green border indicating SUCCESS. The pull_metadata task has a pink border, indicating SKIPPED.

This indicates that:

  1. configure_git ran and completed with a SUCCESS state.
  2. validate_metadata_folder then ran and completed with a SUCCESS state. It also returned a value that forced triggering of clone_metadata and skipping of pull_metadata.
  3. finished_pulling captured the flow between clone_metadata and pull_metadata and ended in a SUCCESS state.

Viewing Task Information

Hovering over a task

When displaying a DAG run, hovering over a task will display information about the task run:

Hover over a task in graph view

Clicking on a task

Clicking on a task in graph view opens a modal to dive deeper into a task run instance. Most helpful features are:

  • Log: Any log output from the individual task instance. This is very helpful as individual task logs not lost in a full DAG or application log.
  • Run: Run an individual task.

Click on a task in graph view

Browse DAG Runs

Browse dag runs

View a DAG Run

View a task run with collapsed task groups

Here we see a DAG run display for a DAG that includes TaskGroups. TaskGroups represent a reusable task structure that can be included in many DAGS without code duplication.

Expanded TaskGroup

Here we see the validate_metadata DAG from above included as a task within this DAG as a TaskGroup. This allows us to reuse this set of tasks in any DAG without code duplication.

Task view with expanded task group

Complex TaskGroup

TaskGroups are a very useful tool for complex sets of tasks, as well as generating tasks to make writing DAGs easier. Here we see a complex set of 20 harvest tasks that were generated based on the provider configuration and run in parallel.

Complex task group display