4.0 Release and deployment process - NEONScience/NEON-IS-data-processing GitHub Wiki

Pachyderm environments

There are three Pachyderm environments:

pachyderm-dev - connected to the INT PDR and portal databases, and nonprod cloud buckets. Most development happens here.
pachyderm-cert - connected to the CERT PDR and portal databases, and nonprod cloud buckets. Full-scale testing of products heading to production.
pachyderm-prod - connected to the PROD PDR and portal databases and prod cloud buckets. Production processing and publication.

Prelimiary checklist for moving to production

Validate the output
Validate the output for a cross-section of sites and dates in pachyderm-dev. If the data product is already being produced in the existing Airflow-based transition system, verify that the output from the new pipeline and the Airflow transitions are consistent and/or differences are expected. Follow the instructions in Wiki section 4.1 Validating L1 output.
Publish to INT portal
If not done already, stand up the publication pipelines in pachyderm-dev, ensuring proper configurations for the product-site-months to be published. Once the monthly cron has run and the publication egress pipeline has completed with at least one month published to INT portal, request a refresh of the INT portal cache. When refreshed, download the LATEST data. Peruse the download packages for expected output, including that the publication timestamps in the download packages match those in Pachyderm (so you know they came from Pachyderm).
Note that the top level of the publication package at the product level will not include documents typically downloaded with the product (e.g. ATBD, C3, etc.). However, all files within each product-site-month folder should completely match what you would expect to download from the NEON Data Portal.
Operational mode
When the test publication to INT portal looks good, place the DAG in operational mode, which means that the nightly & monthly crons will automatically load, process, and publish new data for all sites. This involves:
a) Set the START_DATE in [SOURCE_TYPE]_cron_daily_and_date_control to the 1st of the most recent full month minus the number of days required for a full pad for the QAQC pipeline.
b) Comment out the END_DATE in [SOURCE_TYPE]_cron_daily_and_date_control pipeline
c) Add all relevant sites to site-list.json file in the [SOURCE_TYPE]_site_list repo
d) Remove any location restrictions from the [SOURCE_TYPE]_fill_date_gaps pipeline.
e) Set the START_MONTH in [PRODUCT]_cron_monthly_and_pub_control to the most recent full month.
f) Comment out the END_MONTH in the [PRODUCT]_cron_monthly_and_pub_control pipeline
g) Set SITES: "all" in the [PRODUCT]_pub_egress_and_publish pipeline

Best to do all this in a single Pachyderm transaction. The only pipelines that should be updated with the --reprocess flag are the daily and monthly crons.
After these changes make it through the DAG, the most recent full month of all sites should be published to INT portal. Refresh the cache again and check the output. Request that Blizzard availability monitoring be performed for the relevant product and month for INT portal.
Make pull requests main branches
After successful publication to INT portal, merge the master or main branch from origin into your development branch and carefully resolve any conflicts. Do this for all relevant git repos. Test as needed. Then issue a pull request from your development branch back to the main/master branch. The main branch of most repos is protected, meaning an admin must approve the PR before it can be merged.

Production release and deployment process

Ready to publish data from your data product to the NEON Data Portal? Is the preliminary checklist complete? Great! Deploying to production generally goes like this:

Populate PROD PDR
Populate any new or changed named location properties, groups, thresholds, etc. in the PROD PDR database. In most cases, this can be accomplished by using the PROD SOM interface. Consult as necessary to ensure that any changes will not break existing transitions. When in doubt, populate CERT PDR first and test Airflow transitions in CERT. If the changes break existing Airflow transitions, convene stakeholders to plan for the hand-off or (ideally) find an alternate solution in which both systems can operate in parallel.
Refresh INT & CERT databases
Assuming the metadata and properties above were successfully populated on PROD PDR, request a refresh of INT & CERT databases. This will update INT and CERT to identically match PROD. In addition to updating asset installs, active-periods, etc. that were made on PROD during normal operations, it will serve as a test that the required metadata was corrected loaded to PROD. Check the pachyderm-dev pipelines after the overnight metadata loaders have run. If the pipelines suddenly break on pachyderm-dev or there are unexpected changes to the output, troubleshoot and correct. Another refresh of INT & CERT is not necessary so long as the missing/incorrect metadata is loaded to all three databases quasi-simultaneously.
Make pull requests to cert branches
Where cert branches exist in relevant Git repos, make PRs to merge the master/main branch into the cert branch. The cert branch of most repos is protected, meaning an admin must approve the PR before it can be merged.*
Create Github actions for pachyderm-cert
Github actions allow us to automatically update repos and pipelines in Pachyderm when changes are committed to relevant files in the relevant repos and branches in Github. This is how pachyderm-prod is managed, and thus we will test load pachyderm-cert in the same way. Use the Actions workflows for a similar product as templates (CERT actions only). These are located in the .github/workflows directory of the relevant Github repos. There are a few git repos that contain Pachyderm content, so ensure that all pipelines and repos are covered by Github Actions. For example, the [SOURCE_TYPE]_avro_schemas repo is managed by Github actions in the NEONScience/NEON-IS-avro-shemas repo as well as the BattelleEcology/neon-avro-schemas repo.
Once the new actions are committed, they can be found in the Actions tab of the associated Github repo.
Stand up the DAG in pachyderm-cert
After the CERT Actions are committed and pushed to origin, trigger them manually in the Actions tab to load the DAG to pachyderm-cert. Make sure to first run the Actions for the base repos (e.g. [SOURCE_TYPE]_avro_schemas) before running the Action for the pipelines.
Cron pipelines do not run automatically until their next scheduled run (e.g. overnight for a daily cron). To start processing data immediately, log into pachyderm-cert and run the daily and monthly cron manually. When processing is completem, check the output.
Note that there should be no changes required to Pachyerm pipelines between dev, cert, and prod so long as the pipelines use the generic secrets for the databases and cloud buckets (make sure this is true). The names of the secrets are the same for all Pachyderm environments, but they point to different databases/buckets (i.e. int, cert, prod, respectively).
Load the full history in pachyderm-cert
Now it's time to add data in stages to pachyderm-cert. The entire record for all sites should be processed on cert before loading to prod. Update the [SOURCE_TYPE]_cron_daily_and_date_control and [product]_cron_monthly_and_pub_control pipelines , moving back the START_DATE and START_MONTH, respectively, a few months at a time. Be sure to use the --reprocess flag when updating each pipeline, and update both in a single transaction so that the update results in a single job.
If you commit these changes to git, commit them only to the cert branch, as we will start over with prod/pachyderm-prod from the master/main branch.
Set product to active (if needed)
When publishing an entirely new product, set the product to active in the SOM Data Product Manager. In the same interface, expand the "Show or hide data product sites to edit statuses" button and set all sites expected to publish data to "Exists".
Refresh the portal cache
After data are processed through the final publication step in Pachyderm, the sync to the portal database will occur overnight automatically. Then the cache must be refreshed manually. Once the cache refresh is complete, the data should be downloadable from the CERT Data Portal.
Run availability monitoring
Spot check output downloaded from the CERT Data Portal, including publication metadata. Request Blizzard availability monitoring on CERT for the full data record.
Deploy to pachyderm-prod
After the output to CERT portal is fully verified, repeat steps 3-9 to the prod versions of Github repos/branches, Pachyderm, and NEON databases. Your product is now available to NEON users.

Have a party!