1.2 Processing triggers and date controls - NEONScience/NEON-IS-data-processing GitHub Wiki

In Pachyderm, any change to an input datum triggers processing. This is both powerful and annoying. Powerful in that so long as changes to inputs make it into Pachyderm, processing occurs automatically. Annoying in that without careful design you could end up reprocessing everything with the slightest change. This section describes how the general DAG has been designed to load metadata into Pachyderm, limit reprocessing to impacted data, and control the overall date ranges that data are processed. The model DAG will be referenced throughout this Wiki page, so it's good idea to view it side-by-side with this Wiki. Note that glob patterns (and datum delineation) are another critical part of controlling what data are (re)processed - see the Wiki page devoted to glob patterns for more info.

Sources of data required for data processing

There are many sources of data and metadata needed to process NEON data. The main ones that most pipelines share are:

L0 data
Calibrations
Asset install information
Location metadata, including active periods and groups
QC thresholds
File schemas and empty file templates

Much of NEON metadata management is handled outside of Pachyderm. For example, calibrations are generated by CVAL and stored in a cloud bucket, and location metadata is stored in a database and managed via UI. There are ways that data can stream directly from their sources of truth into Pachyderm pipeline repos and we may eventually use these methods, especially for L0 data ingest. In the short to medium term, however, we need automated methods to transfer or convert data and metadata into pachyderm.

Daily cron and data loaders

Data and metadata stored in NEON databases are automatically loaded on a daily basis into Pachyderm using a cron trigger. This includes asset install information, location metadata, and QC thresholds. The cron trigger looks like a puppet master hovering over the model DAG because it is connected to metadata loaders for these sources. The same cron trigger is used for a whole DAG so that a single job is triggered that loads all updates at once so they can be processed in the same job. But why are the metadata loaders branches off the main trunk of the DAG instead of integrated directly? Read on.

Since metadata updates (such as location metadata) can apply to any point in the past, the metadata loaders refresh all relevant metadata for the chosen source type or product for all time. Unfortunately, there is a mismatch in the granularity of datums between L0 data and associated metadata. L0 datums are naturally one day of data for one source ID (since L0 data continually streams into NEON, processing data by day is a reasonable unit of time). This unit of data is easy to see in the L0 repo structure:

/SOURCE_TYPE
   /YEAR
      /MONTH
         /DAY
            /SOURCE_ID
               /DATA_TYPE
                  /DATA_FILE

In constrast, take a look at the repo structure of asset install information (which is very similar to other metadata), as loaded by the location_asset loader module:

/SOURCE_TYPE
  /SOURCE_ID
     /DATA_FILE

Note how the YEAR/MONTH/DAY structure is missing from this repo. This is because asset metadata spans periods of time and is stored in the database as date ranges rather than discrete dates. These date ranges are included in the DATA_FILE for the SOURCE_ID.

The challenge resulting from this datum mismatch is that, without intervention, an update to the metadata for a source ID would apply to ALL days of L0 data from the source ID and would thus result in a full reprocessing of downstream modules in which the metadata file is included in the repo (because the file has changed, and Pachyderm automatically reprocesses datums that have changed).

We solve this issue by inserting an assignment module after the metadata loader that filters and distributes the metadata information only to the date range it applies, thus matching the repo structure and datum granularity of data within the main trunk of the dag. In fact, because asset and location metadata are updated infrequently, the assignment modules pre-populate the filtered information for the latest year, thus executing only once per year and/or when the metadata for a source ID or location changes.

Caveats

You may notice in the model DAG that the threshold loader is not followed by an assignment module before entering the main trunk of the DAG. This is because thresholds typically are not associated with a date range and a change to thresholds would indeed result in a full reprocessing. Additionally, thresholds are most often specified for an entire site rather than individual named locations, making assignment to individual named locations difficult outside the main trunk. The selection of the appropriate threshold is therefore done within the main trunk of the DAG, but an assignment module may be developed in the future.

You may also notice that the L0 data loader in the model DAG is not connected to the daily cron. This is purely a product of the current status of development. When the system moves to production, a new date of data will be loaded into the [SOURCE_TYPE]_import_trigger repo via the daily cron.

Data years and date control

The metadata loaded into Pachyderm covers the entire history stored in the database. There will be scenarios in which we choose to load and process only a subset of L0 data into the pipeline. (Remember from Wiki section 1.1 that date constraints on L0 data import are implemented by the [SOURCE_TYPE]_import_trigger). In order to constrain the date range of data processed by the assignment modules as well as that processed into data products, there are two additional methods of date constraint.

data_years

The data_years pipeline is connected to the L0 data repository. It looks at the range of data years that are present in the L0 data and writes a single text file listing these years. This text file is an input to both the calibration_assignment module and the location_asset_assignment module, which respectively distribute the calibrations and location installs for each asset to their applicable data dates. These modules limit their assignments to the year range indicated in the data_years text file, thus covering the actual date range in the L0 data.

date_control

Although probably the same, the actual date range of L0 data can be different from the intended date range to produce data products. This could happen if all sensors of a source type were malfunctioning and not sending any data (improbable, yes, but possible). Whereas calibration and asset install information are needed only when L0 data from an asset is present, NEON publishes data from locations whether or not an asset was actually sending data from the location (in this case the Null flag would be raised). Thus, at the point of the date_gap_filler module, where an empty file is inserted for active locations that are missing sensor data, the module needs to know the intended date range of output. This is specified in and provided by the date_control module (included in the same pipeline as the cron trigger), and can be constrained to a particular start and end date, or begun at a start date and continue to present day. When combined with the active periods for each named location, only data from active locations over the intended date range make it past the date_gap_filler and into published data products.

The date_control module performs this function with two components. The first is very similar to the data_years module, in which a single data file listing the year range of data to process is included in the output repo. This file is an input to the assignment module for the active periods of named locations (location_active_dates_assignment) so that this metadata is populated for the years of interest. Second, the date_control module creates a YEAR/MONTH/DAY directory structure in its output repo for the full date range indicated in its input parameters. When this repo is combined with a join input with the data in the main trunk of the DAG in the date_gap_filler module, only active locations over the intended date range pass through for further processing.

Loaders operating outside of Pachyderm

There are two additional methods of loading inputs into Pachyderm.

Calibrations

The source of truth for calibration files is a cloud bucket where the CVAL lab places all calibration files. A calibration file is never edited once produced. Thus, it is a simple process of loading all files from this bucket initially into Pachyderm and then monitoring for any new file that is placed in the bucket. A cloud function in GCP performs this task.

Git integration

Some information used in Pachyderm, such as file schemas and empty file templates for the date_gap_filler module are version controlled in Git repositories. We are currently working on developing a runner that automatically syncs relevant files in the Git repo with its corresponding Pachyderm repo any time a commit is made to the production branch.