Refreshing DLME Metadata in Airflow - sul-dlss/dlme-airflow GitHub Wiki

Airflow is used to manage Extract, Transform, and Load (ETL) automation for DLME. It manages the execution and reporting around sequences of tasks called Directed Acyclic Graphs (DAGs), which are sequences of executable tasks, where there are no loops in the sequence. A diagram of a DLME ETL DAGs for one provider will make this clear:

The above diagram shows the DAGs for Harvard University, which has two collections. Each box represents a task that is executed if and when the prior task succeeds. The tasks and descriptions are:

  • provider_collection_harvest: triggers a harvest task to harvest metadata from the data provider and converts it to a csv file.
  • sync_collection_metadata: syncs the metadata in Amazon S3; this is where the data is stored.
  • transform_provider_collection: triggers a traject config to transform the csv file into a jsonnd file conforming to the DLME application profile.
  • index_provider_collection: indexes the ndjson file in the DLME dev environment
  • provider_collection_harvest_report: generates a coverage and data quality report that can be used by DLME data providers to understand how we transformed their data, and surface data coverage, quality issues.
  • provider_collection_harvest_send_report: sends the above report to a number of pre configured email addresses.