1.4 Populating properties of named locations in Pachyderm - NEONScience/NEON-IS-data-processing GitHub Wiki

New named location properties used in Pachyderm work (context, groups, data rate, active periods) must first be populated in the NEON database before they are accessible in Pachyderm. Much work has been done to develop UIs to view and edit these properties, and some of this work (such as active periods) is already in the PROD version of the database and SOM portal. This section may be out of date with recent progress, so be sure to inquire the current status. We always populate data on INT before pushing to PROD, so all instructions below refer to the INT database and INT version of the SOM portal.

For all named location properties below, it is easiest and most robust to figure out a formula for determining them for the product you are working on by mapping them to existing L0 data product IDs and potentially also HOR.VER locations. This allows scripted populate of the properties in the database. Once we move to the new system, these mappings won't be needed and different methods of populating properties for new locations and/or products will be developed.

Create mappings

Follow these steps to create mappings and populate location properties:

Groups: Check out the Wiki page on Groups-and-Context to understand their functions. A Groups UI is available on INT for viewing and manually editing groups.

Let's look at a relatively simple example. The PAR product (DP1.00024.001) is produced on every tower level at terrestrial sites and on the met station at aquatic sites. Every product instance of DP1.00024.001 uses data from an upward-facing PAR sensor (incoming PAR) with a L0 data product ID of DP0.00024.001. This L0 data product ID is not used for any sensor location that is not used in DP1.00024.001, so it is easy to map locations for the DP0 product to groups. However, it is not a straight mapping between each DP0.00024.001 named location to each group because the PAR product instance at the tower top also includes out-going PAR from an additional downward-facing pqs1 sensor installed at a different named location. This downward facing sensor location is at the same HOR.VER location index as the upward facing sensor, so each combination of DP0 ID, site, and HOR.VER present our mapping to PAR groups:

Groups for PAR data product (DP1.00024.001)
*Group prefix: par
*Location descriptor: SITEHORVER
*Full group name format: par_SITEHORVER
* L0 DP ID: DP0.00024.001
* Locations: Each HOR.VER at the same SITE gets a group (incoming and outgoing sensors are at the same HOR.VER, but different named locations)

See the end of this section for how to submit this info.

Contexts: Many products will not required context to be populated on named locations. Context is necessary only when two sensor locations of the same source type are included in the same group and need to be differentiated in processing, and this differentiation is not possible with other properties on named location (such as site, HOR, VER, data rate). Continuing the example above, context is required for the PAR product locations because the group at the tower top includes two pqs1 sensors in the same group at the same site and HOR.VER (and data rate).

Similar to groups, create a mapping between the L0 data product ID and context. In this case, the full L0 data product ID for DP0.00024.001 has two separate term IDs for incoming PAR (01320 - inPAR) and outgoing PAR (01321 - outPAR). So the mapping is as follows:

Contexts for all DP0.00024.001 locations

context: upward-facing
* L0 term: 01320 - inPAR

context: downward-facing
* L0 term: 01321 - outPAR

It is smart to consider whether the context(s) you create can apply to other products, especially if they use the same source type. In our case, the pqs1 sensor is also used in DP1.20042.001 - PAR at water surface, which also has an upward-facing PAR sensor and a downward-facing PAR sensor. The L0 data product ID for this product is DP0.20042.001 and uses the same inPAR and outPAR terms to distinguish between the two sensor locations. Thus, you could also populate contexts for those named locations and save yourself some time later. If that product was created first, you would want to check to ensure the contexts you are considering are consistent (and in fact might already be populated. The combined mapping is:

See the end of this section for how to submit this info.

Data rate: Populating data rates of named locations will allow certain modules such as regularization to automatically apply the data rate specifically for each location rather than specifying a single data rate to use for all locations of the product or source type. All data rates are specified in Hz. Data rate information should exist in each product's ATBD. Continuing the example above, all sensor location for PAR (and PAR at water surface) products produce L0 data at 1 Hz, so the formula for populating the data rate is easy:

Data rates for all DP0.00024.001 and DP0.20042.001 locations

Data rate: 1

A more complicated example is water quality (exo source types - exo2, exoconductivity, exofdom, etc. and L0 DP ID DP0.20005.001). The buoy locations (HOR=103) produce L0 data every 5 minutes, whereas the stream locations (any other HOR) produce L0 data every minute. So, the formula for determining the expected data rate for each water quality location is:

Data rates for all DP0.20005.001 locations

Data rate: 0.01666666667 Hz 
* Locations: Any HOR other than HOR=103

Data rate: 0.00333333333333 Hz 
* Locations: HOR=103

The data rate in the database is actually a string to account for non-integer data rates. For these cases be sure to include >= 10 digits after the decimal point (unless they are zeros) so that the regularization module correctly creates the timestamps. If the timestamps do not come out as expected, increase the number of digits after the decimal.

See the end of this section for how to submit this info.

Active periods: Active periods indicate the time period(s) that a particular named location should have data (regardless of whether a sensor was installed there or not). Active periods control the dates for each location during which data will be processed and published to the data portal. Active periods have already been applied in PROD and have been integrated into the existing transition system, so it is very likely you do not need to populate them. If you do, talk with a developer to figure out the best path.

Submit mappings to populate location properties on INT

Once you have generated the information in steps 1-3, create a single story in the CI Jira under the epic for the data product with the mappings you generated. Tag with Components IS, Pachyderm, and DSP. Give a developer a heads up that you created the story. They might be able to fit it into the current sprint. If not, it will be prioritized at the next Data Services Prioritization meeting.