Manual Update & Refresh DAGs - stlrda/211Dashboard-Workflows GitHub Wiki

Manual Update

There are 4 data types within this project that are integrated with the Manual Update DAG. Below is a list of these data types and the files/folders that exist in S3 that are required in order to successfully 'update' the data.

  1. Census
    • census_county (S3: folder)
    • census_tract (S3: folder)
    • resources/census.json (Repo: resource file)
  2. Funding
    • funding_data_final_csv_public.csv (S3: file)
  3. Crosswalk mapping
    • crosswalk (S3: folder)
  4. Areas of interest
    • census_county (S3: folder)
    • resources/areas_of_interest.json (Repo: resource file)

Since all of these data sources come from static .csv or .xlsx files within the project's S3 bucket (uw211dashboard-workbucket), the Manual Update DAG utilizes the s3_transform_to_s3 method of the Scraper class (see scripts/scraper.py and Python Classes).

For all of these updates, the DAG structure resembles this:

truncate_table(s)transform_static_file(s)load_transformed_file

It is important to note that although this DAG is meant to update all 4 data types, if the file(s) from which the data is derived do not change then the database tables corresponding to that same data will not change. For example, if you decide to update the public funding data funding_data_final_csv_public.csv, then you trigger the update DAG, the funding table in the database will update; however, the tables associated with the other data (census, crosswalk, areas of interest) will still be the same (because these files in S3 have not changed).

Manual Refresh

This DAG is very unique and is not expected to be utilized often. We decided to include this DAG to "refresh" past data that's already been brought into the database but that has changed for some reason. We could only envision needing the functionality of this DAG for two of our data sources:

  1. COVID-19 data
    • Because this data and the reporting of this data is very new, states and counties may adjust their reporting methodologies in the future. If this happens, they will have to edit the data from the past to reflect the new/more-standardized reporting methods.
    • In this case, all the data will be refreshed starting from 2019-01-01.
  2. Unemployment Claims data
    • State governments will sometimes adjust past unemployment claims data. If you believe this is occurring with the MO unemployment claims data, it may be wise to do an occasional refresh of the claims data.
    • Again, all data will be refreshed starting from 2019-01-01 when triggering this DAG.

** NOTE: Data is refreshed from 2019-01-01 onward because this project is only concerned with data after 2019-01-01.

After triggering this DAG, you (as a user/maintainer) technically won't need to do anything else. The DAG simply updates the "last success run" date for both the record where run_cd = 'WKLY_ALL' (unemployment claims) and where run_cd = 'DLY_ALL' (COVID) in the cre_last_success_run_dt table. Updating these dates will then affect the migration of data from staging to core the next time the Daily Dag and Weekly DAG run. However, if you would like to see a refresh immediately you may trigger the Refresh DAG and then trigger the Daily/Weekly DAG.