1.3 Groups and Context - NEONScience/NEON-IS-data-processing GitHub Wiki

Groups

A group brings together specific named locations that are required to be processed together in the creation of a data product. A good example of a group is the instance of the Single Aspirated Air Temperature product on measurement level 3 on the tower at CPER (see diagram below, labeled temp-air-single_CPER000030). The processing algorithm for this product instance requires data from 4 different sensors:

prt temperature sensor at ML3 at CPER
fan + tachometer (dualfan) sensor in the aspirated shield at ML3 at CPER
heater in the aspirated shield at ML3 at CPER
2D wind sensor (windobserverii) at ML3 at CPER

All of these sensors are installed at different named locations. Placing them in a group allows the data collected at these locations to be brought together for processing.

A group name is unique and consists of a prefix and a location descriptor in the general format:

Group name = PREFIX_LOCATION

where...

PREFIX = Data product short name (preferred), or good descriptor
LOCATION = SITEHORVER

The prefix helps relate it to similar groups, such as the three temp-air-single groups in the diagram above, which are all instances of the Single Aspirated Air Temperature data product. In fact, if the the group feeds into a single data product, the prefix should match the shortname for the data product, which can be found in the data product manager in the SOM portal. If the group feeds into several data products, choose a more general descriptor. Use kebab case (dash-separated) to separate words in the PREFIX. The LOCATION descriptor provides an indication of the specific product instance. Generally, the location descriptor will match the format SITEHORVER, which combines the 4-letter NEON site code and the horizontal and vertical location indices. Do not use any special characters to separate terms in the LOCATION descriptor, especially underscores (_) or dashes (-). An underscore separates the PREFIX descriptor from the LOCATION descriptor. This aids searching for all groups matching a particular prefix. For example, searching for "par_" (note the underscore) will unambiguously return all groups with 'par' as the full prefix, whereas searching for "par" would return groups matching multiple prefixes, such as "par" and "par-surfacewater" groups.

Note that:

A location can be included in more than one group
Groups may be members of other groups

An example of the latter is the Barometric Pressure data product, in which an example group is shown in the diagram above (pressure-air_CPER000035). Producing this product requires data from the barometric pressure sensor (ptb330a) on the tower along with the L1 output for the Relative Humidity data product on the tower at the same site. Thus, the group for the instance of Barometric Pressure at CPER will include the named location for the ptb330a at CPER and the group for the Relative Humidity product on the tower at CPER (rel-humidity_CPER000040).

A group is always required in order to publish data from Pachyderm, even if the group contains a single named location as a member. This is to create consistency in the product pipelines and because groups have properties attached to them that are used in the publication process.

Context

Context is a free-form string that can be attached as a property to named locations and/or QC thresholds.

Context on named locations

When used for named locations, context typically describes the environment or application in which the measurement is used and that is not otherwise described by its source type, group name, or group properties. Most often, context will enable differentiating among multiple locations of the same source type within the same group. For example, the Photosynthetically Active Radiation (PAR) product instance at the top of the tower at CPER (and every other tower) includes both incoming and reflected radiation measured by two sensors of the same source type (pqs1), one that faces up and one that faces down. The group for this product instance (e.g. par_CPER000040 in the diagram above) will include both named locations. Why do we need context here? Take a look at the repository structure for data in this group collected on January 1, 2020.

/2020                                                        <-- year
   /01                                                       <-- month
      /01                                                    <-- day
         /par_CPER000040                                     <-- group for the PAR data product instance at CPER ML4
            /group                                           <-- group metadata directory
               /CFGLOC101563.json                            <-- group metadata file for each member of the group
               /CFGLOC101564.json                            <-- group metadata file for each member of the group
            /pqs1                                            <-- source type of the PAR sensor
               /CFGLOC101563                                 <-- named location of the upward-facing PAR sensor
                  /data                                      <-- subdirectory for sensor data
                     pqs1_CFGLOC101563_2020_01_01.parquet    <-- sensor data file
                  /location                                  <-- subdirectory for location metadata
                     CFGLOC101563.json                       <-- location metadata file
               /CFGLOC101564                                 <-- named location of the downward-facing PAR sensor
                  /data                                      <-- subdirectory for sensor data
                     pqs1_CFGLOC101564_2020_01_01.parquet    <-- sensor data file
                  /location                                  <-- subdirectory for location metadata
                     CFGLOC101564.json                       <-- location metadata file

Both named locations are of the same source type, so they cannot be differentiated that way. If one were to open the location file nested under each named location, or any of the group metadata files for each named location, one would see that they also have the same HOR and VER location indices because they are both on the tower (HOR=000) at measurement level 4 (VER=040), and those are also the HOR and VER indices for the group. In this case, we need context to tell these named locations apart, which is also stored in each location file (see next section). The context 'upward-facing' is assigned to the upward facing location and the context 'downward-facing' is assigned to the downward facing location.

If necessary, multiple contexts can be assigned to the same named location. Avoid assigning a context that overlaps things already described by the group (i.e. data product) or other properties on the named location, such as location indices (HOR & VER).

Assigning or removing contexts is done in the Named Location Manager for each specific named location, or can be done in bulk by spreadsheet upload. If added/removed in the UI, be sure to scroll to the bottom of the page and "Submit All Changes". You can also create or delete contexts in the Named Location Manager in the same UI as you would assign or remove a context for a named location. In order to delete a context, it must first be removed from all named locations and not be used in any thresholds. Use the Search feature in the Named Location manager to search the locations matching a particular context. It is okay if a context is only used for thresholds and not named locations (read on).

Context on thresholds

When used in QC thresholds, context is used to differentiate QC thresholds for data products that share the same term name - more on this in Thresholds.

Note that the contexts used for named locations do not need to be the same as the contexts used for thresholds.

Where are group and context stored?

The source of truth for groups and contexts is the PDR database, and they may be viewed and edited in the online SOM portal. These are loaded or updated in Pachyderm on a daily basis.

Group information is stored a json file for each group member, loaded via pipeline [GROUP_PREFIX]_groups_loader and applied to existing pipelines using the using the [SHORT-NAME]_group_path module. After execution of these modules, group information will accompany data in downstream pipelines in the group folder. A file in this folder list the groups and associated metadata that the member is a part of with an entry that looks something like:

     ...
     "name": "CFGLOC101563",
     "group": "par_CPER000040",
     "active_periods": [
          {
               "start_date": "2013-09-12T00:00:00Z",
               "end_date": "2016-01-01T00:00:00Z"
          }
     ],
     "HOR": "000",
     "VER": "040"
     ...

In the example above, name is the name of the member (either a named location or a group) and group is the name of the group that it is a member of. The metadata below these fields are specific to the group (and can differ from similar properties on name locations).

The [SHORT-NAME]_group_path inserts the group name is in the path structure, which makes it easy to apply the filter_joiner module to bring together the data from all the members of each group for further processing. Read how groups change the repository structure in Section 1.0 Pipeline & repo structure, pipeline naming, terms. Also see the Wiki section on the filter-joiner module to see how data from the group members are brought together after the group name is inserted into the path.

Context is a property of named locations, and can be found in two different spots:

The location JSON file for a sensor in the [SOURCE_TYPE]_location_asset repo and accompanying data in downstream pipelines in the location folder.
The location JSON file for a particular named location, as found in the [SOURCE_TYPE]_location_loader repo and accompanying data in downstream pipelines in the location folder.

These files list the context(s) for the associated named location with an entry that looks something like:

   ...
   "name": "CFGLOC101563",
   "site": "CPER",
   "context":[
              "upward-facing"
                ],
   "active_periods": [
     { 
	"start_date": null,
	"end_date": null
     {
   ],
   "HOR": "000",
   "VER": "040",
   ...

In the example above, name is the name of the named location and context lists the contexts associated with the named location. The metadata below these fields are specific to the named location.

Unlike groups, context information remains only in the location file and is not inserted into the path structure. If it is necessary to split a repository based on context, use the context_filter module. Otherwise, your code may access the location file directly and determine the context(s).