3.0 Updating modules or creating new modules - NEONScience/NEON-IS-data-processing GitHub Wiki

Updating a NEONprocIS package

As part of creating new products, you may need to add to or modify the functions in the NEONprocIS family of packages (NEONprocIS.base, NEONprocIS.cal, etc.), or create your own package! If so, to test your work you'll need to (re)package and (re)install the affected package in Rstudio. There's a handy script to do just this in /utilities/R_coding/flow.pack.is.proc.R. Edit the working directory (DirWrk00) to point to the root of your local NEON-IS-data-processing repo, select the package to update in the namePack variable, and choose whether to run the unit tests for the package in the RunTest variable (strongly recommended if you update any existing functions), and run the code. This will repackage and reload the package in your local library. If you chose to run the unit tests, you will also see the outcome of the testing. If there are any failures, you may need to adjust your code or update the unit tests. See the Unit Testing section below. Once you verify the output and are ready push your changes into production (i.e. create a pull request to master), be sure to increment the package version in the DESCRIPTION file in the root directory of the package.

See the section on Unit Testing below for creating or updating unit tests for the package.

Add dependency management

To ensure that your code is accompanied by the same package versions you created it with, we use the renv package for R to record the packages and their versions used in the code and to build them into a docker image. When you finish creating/editing the code for a module or NEONprocIS package, run the /utilities/R_coding/renv.init.rstr.R script. All you need to do before running it is to edit dirWork to point to the parent folder of the package or module (flow script) you created/edited. Running the code will create or edit the renv.lock file in the parent directory, which is the catalog of all the dependencies.

Creating or updating a docker image

In order to deploy your code in Pachyderm it needs to be packaged into a docker image. The recipe for creating the docker image is specified in the dockerfile, which should be located in the main folder of your module or package. The basic system dependencies for all R modules in this Git repo are included in the neon-is-base-r image, which includes the NEONpackIS.base package. This means that if you build your image on top of the neon-is-base-r image, the NEONprocIS.base package and all its dependencies are already installed. If you're curious, this dockerfile is located at /pack/NEONprocIS.base/. Every other NEONprocIS.___ package builds off of the neon-is-base-r image. This allows their (and your) dockerfiles to be much simpler.

If you are creating a dockerfile for the first time, work off of the dockerfile for an existing package or module already in Git (except NEONprocIS.base). Chances are you'll just have to replace a few obvious components to make it work for your package or module.

To build the docker image for your package/module with dependencies in the renv.lock file, add the following to your dockerfile (which should be in the same directory) before the lines that copy in your module workflow script or install your R package (see the dockerfile for any R-based package or module in this Git repo for a full example).

COPY ./renv.lock /renv.lock
RUN R -e 'renv::restore(lockfile="/renv.lock")'

Obvious components include:

the appropriate version tag for neon-is-base-r (e.g. FROM quay.io/battelleecology/neon-is-base-r:v0.0.64)
COPY all new functions from your new work e.g.

COPY ./flow.tsdl.comb.splt.R /flow.tsdl.comb.splt.R
COPY ./wrap.file.comb.tsdl.splt.R /wrap.file.comb.tsdl.splt.R

Once you create (modify) a package or module, you need to build (rebuild) the docker image, tag it with a new version, and push it up to quay.io. On the development server, first log in to quay.io with your credentials (typically need to do this once and never again):

$ docker login quay.io

Then build the image:

docker build --no-cache -t <image name> </path/to/folder/with/dockerfile/in/it>

Be sure to follow naming conventions for images, e.g. neon-is-name-of-modl-r.

Now tag the image with a new version. If this is the first time the image is built, tag it with v0.0.1. If it's a new version of an existing image, increment the current tag found on quay.io by 1 (e.g. increment v0.0.1 to v0.0.2):

docker tag <image name> quay.io/battelleecology/<image name>:<tag>

Note - you MUST increment the tag if the current tag has been deployed in Pachyderm. If you overwrite the current tag with a different/updated image, Pachyderm will always use the first image it downloaded with that tag (to retain provenance).

Finally, push the original or new tag to quay.io:

docker push quay.io/battelleecology/<image name>:<tag>

If you receive this error: unauthorized: access to the requested resource is not authorized, then it's likely that you need to have write access to quay.io and/or need a new quay.io module created. If the latter, request a new quay.io/battelleecology repo on service now.

After you've pushed your new image tag to quay.io, update all the pipeline specs and other dockerfiles that use this image (even if you didn't create them). The Find In Files utility in Rstudio is very helpful to find and update them all manually. There is a handy script to batch update an image version across all pipeline specs, see section below.

A final note about updating images: If you've just updated the image tag for a NEONprocIS package (e.g. NEONprocIS.cal)... Wait! You aren't done updating docker images. No modules (e.g. calibration conversion module) directly use the neon-is-base-r image or the image for any other NEONprocIS package. Images are hierarchical so that code changes at the end points (i.e. specific modules) are isolated from one another, while code that affects multiple modules is distributed accordingly. Every module has its very own docker image, so if you only update the code for the calibration conversion module you only need to update that image (neon-is-cal-conv-r). But if you update the image for the NEONprocIS.cal package, you also need to update the images for the calibration filter module and the calibration conversion module, since both depend on the NEONprocIS.cal package. You'll know the downstream images you need to update from the result of your Find In Files search mentioned above. Updating downstream images is as easy as repeating the docker commands above for all downstream images/modules, being sure to update their image tags in pipeline specs and dockerfiles that reference them.

Batch update an image version in all pipelines

If you've updated an image version that is used in a lot of pipelines, a 2-stage process can a) update all pipeline specs in the Git repo with the new image and b) update all the corresponding pipelines in pachyderm.

Update all pipeline specs in the Git repo with the new image

In your terminal window, navigate to your local instance of the NEON-IS-data-processing Git repo, and specifically to the utilities folder. For example:

cd ~/R/NEON-IS-data-processing/utilities

From there, run the following:

python3 -B -m pipeline.image_update  --spec_path=<path to the folder where the pipeline specs are (all child folders will be searched)> --old_image=<full name of old image> --new_image=<full name of new image>

For example:

python3 -B -m pipeline.image_update --spec_path=/home/NEON/csturtevant/R/NEON-IS-data-processing/pipe --old_image=quay.io/battelleecology/neon-is-cal-conv-r:v0.0.42 --new_image=quay.io/battelleecology/neon-is-cal-conv-r:v0.0.43

Update the corresponding pipelines in pachyderm with the new image:

python3 -B -m pipeline.pipeline_update --spec_path=<path to the folder where the pipeline specs are (all child folders will be searched)> --image=<full name of new image (must match what's in the pipeline specs)> --reprocess=<(optional) true to reprocess the pipelines>

For example:

python3 -B -m pipeline.pipeline_update --spec_path=/home/NEON/csturtevant/R/NEON-IS-data-processing/pipe/ --image=quay.io/battelleecology/neon-is-cal-conv-r:v0.0.43 --reprocess=true

Note that the paths you put into the arguments must be absolute paths (don't use e.g. ~/R/...). For updating the pipelines in pachyderm with the new image, you probably don't need to reprocess them (using --reprocess=true) unless you want to test that the new image works.

If you are working on the som development server, all the python packages needed in order to run the script are already installed. If not, you'll need python3 and the ruamel_yaml package installed.

Unit Testing

Any time you create or edit a function, unit tests for that function need to be created or run to ensure the output conforms to expectations. CI (Mija or Visala) will help you create unit tests, or make updates to unit tests if the new output is expected to differ from previous output.

A great tutorial for creating and running unit tests in R is here, and a specific wiki for the R testing framework employed in this repo is here (internal NEON access only). Here's an example of how to run all the unit tests for the NEONprocIS.base package and view the test coverage from the RStudio command line:

library(covr)
setwd("~/R/NEON-IS-data-processing-homeDir/pack/NEONprocIS.base/tests/testthat")
devtools::test(pkg="~/R/NEON-IS-data-processing-homeDir/pack/NEONprocIS.base")
cov <- covr::package_coverage()
report(cov)

Running units tests and checking test coverage for any of the workflow wrappers is very similar. Here is an example for the wrap.loc.grp.asgn function which is the wrapper for the location or group assignment module:

library (covr)
setwd("~/R/NEON-IS-data-processing-homeDir/flow/tests/testthat")
cov <- covr::file_coverage(source_files="~/R/NEON-IS-data-processing-homeDir/flow/flow.loc.grp.asgn/wrap.loc.grp.asgn.R",
                           test_files="~/R/NEON-IS-data-processing-homeDir/flow/tests/testthat/test-wrap-loc-grp-asgn.R")
report(cov)

Note that this latter option runs the test for a single wrapper, whereas the former example runs the tests for an entire package.

A tutorial for creating unit tests in Python with the pytest package is here. To run all unit tests for python modules located in the modules directory of this git repo, simply execute the run_unit_tests.sh script from the command line from within that directory. For example:

cd ~/R/NEON-IS-data-processing/modules
./run_unit_tests.sh

To run the tests for a particular module, for example the group_test module:

cd ~/R/NEON-IS-data-processing/modules
python3
import pytest
pytest.main(["-x","group_path"])