Home - NEONScience/NEON-IS-data-processing GitHub Wiki

Welcome to the NEON-IS-data-processing wiki!

How to use this wiki

This wiki covers the concepts and implementation instructions for creating NEON instrumented systems data product in the new IS data processing pipeline in Pachyderm. Throughout the wiki, [SENSOR] and [SOURCE_TYPE] are synonymous and are used as a placeholder for a sensor name in the bash/Python/R code. (Note, we use SOURCE_TYPE because data can be sent from non-sensors, like heaters, so SOURCE_TYPE is more generic. Before running a command or script replace [SENSOR] or [SOURCE_TYPE] with the name of your sensor/source type.

The numbers at the beginning of each Wiki section are organized as follows:

  • 0 - Workflow for product creation
  • 1 - General concepts for processing pipelines
  • 2 - Details and how-to info for particular modules or series of modules in the pipeline
  • 3 - Module development
  • 4 - Release and deployment
  • 9 - Technical tips

Terminology

  • source type: The type of device or sensor from which the data derives, typically matching the schema of the data it collects. The name of each source type is controlled by the Engineering team, and is typically a short version of the sensor model.
  • schema: A document describing the format of a data file, including column names, units, and other metadata about the data. These are in avro format.
  • source ID: The unique identifier for a specific sensor or source type. This is typically the asset UID in NEON's asset management system (Maximo). The source ID is attached to the raw L0 data sent from the field site.
  • named location: The virtual location in NEON's database that corresponds to a physical location at a field site. The named location hierarchy is an abstracted representation of a NEON field site. When a sensors are installed physically in the field, they must also be installed virtually at a named location in order to send, receive, and process the data. Many properties are attached to each named location that aid data processing, such as the sensor's geolocation and orientation, expected data rate and date range of operation. You can view and edit properties of named locations in the Named Location Manager on NEON's internal SOM portal.
  • module: A processing step in Pachyderm. A module is essentially a docker image with the code required to perform an action on the data, such as applying calibration coefficients to transform the data from L0 (raw units) to L0' (calibrated, engineering units) and produce associated uncertainty data and quality flags.
  • DAG: Directed acyclic graph. This is the diagram that shows how modules are connected together to process data for a data product.
  • pipeline specification: The set of instructions for deploying a module in Pachyderm, including the inputs, outputs, and processing parameters for the module.
  • pipeline: In Pachyderm terminology, a pipeline is a single deployed module. We tend to use this term interchangeably to mean a single deployed module as well as the full product DAG.
  • repo: Short for repository. In Pachyderm a repo is a standalone directory structure with files in it. Think of this as a mini hard drive. Repos are typically the inputs and outputs of a pipeline. A pipeline does not have to have an input repo, but every pipeline has an output repo with the same name as the pipeline.
  • term: The name of a column or variable in a data file. The terms database is currently the source of truth for all terms used in L1 data product outputs, but interim terms are uncontrolled.

Github repo organization

This Git repo is organized as follows:

  • / (root) Reserved for the repository readme, license, and .gitignore. Please do not place any other individual files in the root directory.

  • /flow Contains the contents for each science processing module. These are processing modules that perform science operations such as calibration, data regularization, quality control, statistics, etc. One folder per module. Each folder contains:

    • The workflow script and or wrapper function(s) specifically for the module (common functions are included in packages within the /pack folder). Place as much of the module code as possible within a wrapper function so that unit tests can be written for it.
    • Unit tests for the module.
    • Dockerfile for the module, used to build a Docker image with the code and all its dependencies in it.
    • Dependency documentation. For R modules, this is a renv lockfile (renv.lock) which lists all the packages and their versions required for the code. The Dockerfile reads the lockfile and automatically installs the packages in the Docker image.
  • /modules Contains the contents for each cyber-infrastucture processing module. These are processing modules that perform database and repository structuring operations operations common to all pipelines, such as data and metadata loading, repo grouping and filtering, etc. The folder structure is very similar to /flow (and perhaps will be combined with /flow in the future).

  • /modules_combined Contains the dockerfiles for combined modules in which two or more modules from the /flow and/or /modules directories have been merged into a single module in Pachyderm. The underlying code base of the modules is the same, they are simply placed into the same docker image so they can be run sequentially in the same Pachderm pipeline for performance reasons.

  • /pack Contains packages that provide common functionality across modules. One folder per package. Each folder contains:

    • Functions in the package.
    • Package metadata, including version, license information, and any other documentation specific to the package (e.g. NAMESPACE & DESCRIPTION files for R packages, internal data, manual pages for functions, dependency documentation etc.)
    • Unit tests for the package functions.
    • Dockerfile for the package, used to build a Docker image with the package and all its dependencies installed.
  • /pipe Contains specification files for deploying data product pipelines (i.e. the sequence of processing modules that creates each data product) in Pachyderm. One folder per continuous DAG section, named for either the source type or data product (see Wiki section on 1.0 Pipeline & repo structure, pipeline naming, terms. Each folder contains all the pipeline specifications that make up the DAG section.

  • /utilities Handy utilities to support data product creation in Pachyderm, such as standing up whole DAG sections, copying data to/from Pachyderm, etc. Loosely organized by functionality.

  • /wiki Images embedded in wiki pages.

Best practices

Minimize data stored in the Git repo

As a general rule, the only data stored in this Git repo should be that necessary for unit testing or internal package data. Any data used for this purpose should be reduced to the minimum required, by e.g. removing all but the few rows needed to validate function behavior.

Naming conventions

There are no strict naming conventions enforced in this Git repo. However, best practice is to use short 2-4 character abbreviations for words, combined hierarchically with a delimiter so that code applicable to specific areas list together alphabetically. Examples: def.cal.conv.R, def.cal.meta.R.

Naming convention for code

Use a prefix that indicates the type of code. Current prefixes include:

  • def: a definition function that provides base functionality that could be applied in different scenarios
  • wrap: a wrapper function that applies one or more definition functions for a particular use case.
  • flow: a workflow script. This is typically the code for a particular processing module deployed in Pachyderm, applying wrapper and/or definition functions to process all the data that is provided to it.

For code, the recommended delimiter between terms is a period (e.g. flow.cal.conv.R).

Naming convention for docker images

For docker images, the recommended delimiter is a dash, with a prefix neon-is- followed by the name of the main workflow script that it houses, followed by a suffix for the programming language (e.g. -r). For example, the main workflow script for the calibration conversion module is the R script flow.cal.conv.R, and its associated docker image is neon-is-cal-conv-r.

Naming convention for pipelines

For pipeline specifications, name the file the same as the pipeline name. The recommended delimiter is an underscore. For example, the pipeline for performing calibration conversion for prt sensors is called prt_calibration_conversion. It's associated pipeline specification file is called prt_calibration_conversion.json. Pipeline names tend to deviate from the 2-4 character abbreviations in favor of more human readable names, although the names are used consistently across pipelines. See the Wiki page on 1.0 Pipeline & repo structure, pipeline naming, terms for more guidelines on pipeline naming.

Documentation

All workflow scripts and functions should contain a header that lists the title, authors, description, inputs, outputs, relevant references and a changelog. The standard Roxygen header for R functions provides a great template. Be sure to identify the data type or class of inputs and outputs. Ideally, include an easy-to-reproduce example of how to run the code or function. Also comment within the code itself. Concise commenting throughout the code on what each portion is doing and any special considerations will aid error-free edits and updates.

Versioning

Use semantic versioning following the standard Major.Minor.Patch format (see: https://semver.org/) for both code packages and docker images. Update for every deployed change. Before the code in this repo goes to production (i.e. used to produce data products on the NEON Data Portal), versions should begin with 0.#.#.

Use functions as much as possible

Functions can be re-used across processing modules and can be validated with unit tests. As much as possible, create functions for a tasks that are likely to be repeated, and house the majority of the code for processing modules in a wrapper function so that tests can be written for it.

Use hierarchical logging

All R code within this repo uses hierarchical logging so that the right level of detail is displayed for the situation. In increasing order, logging levels include DEBUG, INFO, WARN, ERROR, and FATAL. Logging at a particular level will show logs at that level and higher. For example, in normal operation, only logs at the level of INFO may be needed to display what the code is doing. When an error in the code is discovered, logging at the DEBUG level will display more fine-grained detail to better pinpoint where the problem is. All R code in this Git repo uses the lgr package. See the handy NEONprocIS.base::def.log.init function to initialize logging in your code.

Use dependency management

Use renv, packrat, etc. to explicitly document & track dependencies for your code. See the Wiki page on 3.0 Updating modules or creating new modules for more info.

Make your code efficient at scale

Computational and memory efficiency are extremely important when code is run at scale. Use a code profiler (such as the profvis package for R) that will identify cpu and memory bottlenecks in your code. There are often packages that utilize C bindings and other algorithm efficiencies that can replace inefficient functions. These may blow 'vectorization' and other best practices for the language out of the water.