4.1 Validating L1 output - NEONScience/NEON-IS-data-processing GitHub Wiki

After you've generated L1 (or higher) output from your pipeline, you'll want to compare the Pachyderm output to that of the existing transition system. Tools for doing so are fairly limited right now but will continue to be developed. There is already a nice tool that will work for most situations. It compares all data found in the final pachyderm output repository to existing transition output.

Setup

If you haven't already, clone the omnibus-tools Git repo and start a new Rstudio project for it on the som server. Once you've cloned the omnibus-tools repo, source the functions in the /pipelines/r_utility_tools directory, specifically, def.download.data.R and def.data.comp.R. You'll need to run these from the som server.

Running the comparison

The function you'll run is def.download.data.R. Read the instructions in the header and adjust the input parameters specific to your product. You'll need to provide a directory (on the som server) for the code to place the results in, so make sure to create one before running the code. You'll also need to set the environment variable for the path to the service account credentials file in order to access current transition output in GCP. If you're on the som server, here's how you do this (in R):

Sys.setenv(GCP_PROD_SA_PACHYDERM_VERIFICATION_JSONKEY="/scratch/pachyderm/serviceAccounts/pachyderm-verification.json")

If the parQuantumLine pipeline is up pachyderm-nonprod, the following is a working example so long as you followed the instructions above and adjust the text in angle brackets to match your directory setup on the som server:

def.download.data(
     pachydermrepo = "parQuantumLine_level1_group_and_consolidate",  
     repostructure='consolidated-group-focus',
     dpid="DP1.00066.001", 
     startdate="2020-01-02", 
     enddate="2020-01-03", 
     dirList=c('stats','quality_metrics'),
     outputfilepath='/scratch/pfs/compare',
     avrolibpath='<path to your NEON-IS-data-processing Git repo>/pack/NEONprocIS.base/ravro.so',
     RmvDataDnld = TRUE
)

If the function executes successfully, check the output folder that you specified. There will be a folder named for the pachydermrepo input above, and within there will be a file for each pachyderm file that was evaluated with a _Output.txt appended to the file name. Inside will be a summary of the differences between the transition output and the pachyderm output. There will also be an overall summary table for all evaluated files in finaloutput.csv.

The finaloutput.csv is a good place to start to see what pachyderm files, if any, exhibited differences from the transition output. Be sure to understand any differences, and check at least a few of the individual _Output.txt file comparisons to ensure that all terms expected in the output were present and compared. A major caveat is that the code can only compare the terms that overlap between the transition output and the pachyderm output, but in most cases we aren't changing L1 term names. Since there are often multiple pachyderm output files for a single product instance (e.g. stats and quality metrics), you'll often see a set of variables from the avro data that were not compared. This is fine so long as they are compared in another pachyderm file, or there is a specific reason the variables are no longer expected in the output.

If you see some differences or issues that you want to investigate further, you can explore the actual data files (pachyderm files and transition output files) by setting the input parameter RmvDataDnld = FALSE and running the comparison code again. This will download and retain the data files in the outputfilepath, which you can then open in Rstudio (avro files can be read with NEONprocIS.base::def.read.avro.deve - use absolute paths for all inputs to this function; parquet files can be read with NEONprocIS.base::def.read.parq). Note that subsequent runs of the code will append the comparison output to existing files, so it's a good idea to delete the previous output before rerunning.

Troubleshooting

No output. You'll get errors similar to this:

trying URL 'https://s3.data.neonscience.org/prod-is-transition-output/provisional/dpid=DP1.00041.001/ms=2019-01/site=CPER/CPER_L0_to_L1_Soil_Temperature_DP1.00041.001__2019-01-02.avro'
Error in download.file(avrourl, file.path(outputdir, avrofilename), mode = "wb",  : 
  cannot open URL 'https://s3.data.neonscience.org/prod-is-transition-output/provisional/dpid=DP1.00041.001/ms=2019-01/site=CPER/CPER_L0_to_L1_Soil_Temperature_DP1.00041.001__2019-01-02.avro'
In addition: Warning message:
In download.file(avrourl, file.path(outputdir, avrofilename), mode = "wb",  :
  cannot open URL 'https://s3.data.neonscience.org/prod-is-transition-output/provisional/dpid=DP1.00041.001/ms=2019-01/site=CPER/CPER_L0_to_L1_Soil_Temperature_DP1.00041.001__2019-01-02.avro': HTTP status was '404 Not Found'

Solution: The problem is likely that the naming convention for the path to the transition output bucket is different for this product. How do you determine the right naming convention? Go to the GCS bucket for transition output and try browsing to your product. Once you get to the data you think you should be comparing, compare that address to what you're seeing in the error. It's probably different. Sometimes there are multiple transition names for the same L1 data product (e.g. "L0_to_L1_Barometric_Pressure_REVB_Tower" and "L0_to_L1_Barometric_Pressure_REVB_Met_Station"). There is a list of transition names for each product in the beginning of def.download.data, which the code will cycle through when looking for existing transition output. Add all known transition names for your product.

If you end up editing the comparison codes to get them working for your product, commit your changes so that everyone has access to them. If you get stuck and irritated that it's not working, seek help.