Configuration Apache Airflow for DLME - sul-dlss/dlme-airflow GitHub Wiki

Apache Airflow for DLME Configuration

Terminology

Provider

Collection

Driver

Currently supported Drivers:

  • csv
  • iiif_json
  • oai_xml
  • xml*

Intake Catalogs

Each new data provider must be added in the catalog.yaml file under sources like so:

aub:
    args:
      path: /opt/airflow/catalogs/aub.yaml
    description: "American University of Beirut"
    driver: intake.catalog.local.YAMLFileCatalog
    metadata: {}

catalog.yaml is read in order to know where to fetch configuration catalogs for each provider. args.path and description will be different for each provider and must be specified by the user. driver and metadata can be copied as above.

For each provider, create a separate configuration file and update args.path with its location. Here are the contents of an example configuration file:

metadata:
  version: 1
sources:
  aco:
    driver: oai_xml
    args:
      collection_url: https://libraries.aub.edu.lb/xtf/oai
      metadata_prefix: oai_dc
      set: "aco"
    metadata:
      fields:
        id:
          path: "//header:identifier"
          namespace:
            header: "http://www.openarchives.org/OAI/2.0/"
          optional: True

Each collection is nested under sources and the specific configurations for that collection are nested within it. driver specifies the intake driver you wish to use to map the source data to a pandas dataframe. The args will vary slightly across drivers; these variations are listed below, under each driver heading. For all driver types, metadata.fields.id must be filled out with the path

To do: What does optional mean? Is it always set to optional?

csv

For csv files, the pandas dtype needs to be specified as intake will attempt to guess the pandas dtype when not specified and it may guess incorrectly.

iiif_json

The iiif_json driver will fetch all contents nested under the metadata field of a IIIF manifest. Objects that are not nested under metadata must be explicitly listed in the configuration file. Here is an example:

alexandria_bombardment:
    description: "Alexandria Bombardment of 1882 Photograph Album"
    driver: iiif_json
    metadata:
      data_path: auc/iiif/alexandria_bombardment
      config: auc_iiif_config_csv
      fields:
        context:
          path: '@context'
          optional: true
        description_top:
          path: 'description'
          optional: true
        id:
          path: '@id'
          optional: true
        iiif_format:
          path: 'sequences..format'
        profile:
          path: 'sequences..profile'
        resource:
          path: 'sequences..resource.@id'
        thumbnail:
          path: 'thumbnail..@id'
          optional: true 

oai_xml

xml

The xml driver is intended to be a generic driver for parsing any xml file. As such, it cannot safely make assumptions about the data, such as shape or naming conventions. These must be specified in the configuration file. The record_selector must be specified in order to identify the xml element denoting a new record. All fields must be listed in the configuration file (with the correct path and namespace) in order to be be mapped. Here is an example:

metadata:
  version: 1
sources:
  aims:
    description: "American Institute for Maghrib Studies"
    driver: xml
    metadata:
      data_path: aims
      config: aims_config
      record_selector:
        path: "//item"
        namespace:
      fields:
        id:
          path: ".//guid"
          namespace:
          optional: false
        title:
          path: ".//title"
          namespace:
          optional: true
    args:
      collection_url: https://feed.podbean.com/themaghribpodcast/feed.xml