Airflow Design - stlrda/211Dashboard-Workflows GitHub Wiki
Airflow Intro
Apache Airflow is used to incrementally update data in the 211Dashboard project's database. Airflow uses DAGs (or Directed Acyclic Graphs); each DAG is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. For example, after scraping data from the web one would want to load that data into the database, so a simple DAG structure may look like this:
Scrape_Data → Load_Data
Where, in the example above, Scrape_Data could represent a Python Operator class that fetches the data from some URL and Load_Data could be a PostgreSQL Operator class that uses a copy
command to load a CSV file into a database table. The important thing to note is that Load_Data won't run until Scrape_Data is complete, thus, defining the relationship and dependencies between these tasks within the broader pipeline. A more complete example of a DAG is depicted below. This image shows the 211Dashboard's Weekly update DAG.
images/airflow_screenshot_DAG.PNG
211Dashboard Dags
There are 5 main DAGs within this project. The description and components of each are mapped out below:
-
Startup DAG
- This DAG creates the database environment at "startup" time.
- Creates and defines all database tables
- Backfills unemployment data (2019)
- Populates "static" data tables (e.g. Census data, funding data, crosswalk mappings, areas of interest)
-
Daily DAG
- This DAG runs daily to keep the COVID-19 and United Way 211 data up-to-date.
-
Weekly DAG
- This DAG runs weekly and updates the Missouri Unemployment Claims counts data.
-
Monthly DAG
- This DAG runs monthly and updates the Unemployment Statistics data from the Bureau of Labor and Statistics.
-
Manual Update DAG
- Planning to use this DAG to manually trigger the update of "static" data in the database.
- This includes Funding data, Census data, HUD Crosswalk mapping data, and the project's areas of interest file.
- Upon triggering this DAG, all of these data will be updated. So if you haven't actually updated some (or more) of these files, that's okay. The DAG will simply trigger the update and will just use the existing data files. So as long as all the correct files are still in the S3 bucket, no information will be lost when triggering this DAG.
- More info can be found at Manual Update & Refresh DAGs.
-
Refresh DAG
- This DAG must be manually triggered and it "refreshes" both the COVID and Unemployment Claims data (by "refresh" we mean all the data in the core table is wiped out and then filled back in again - this method can grab any changes to "past" data).
- When triggered, this DAG sets the "last success" timestamp for both the Daily and Weekly DAGs, so that the next time those DAGs run the data is completely refreshed.
- If you don't want to wait for the Daily or Weekly DAG to run on its own time, after triggering this Refresh DAG you may also trigger the Daily/Weekly DAG to perform an immediate refresh of the data.
- This DAG is not expected to be triggered often. More info can be found at Manual Update & Refresh DAGs.