4.5 NEON data release and history (commit) squashing - NEONScience/NEON-IS-data-processing GitHub Wiki

Every year NEON issues a data release, which is an unchangeable snapshot of the entire data record for every data product. This is the only time each year that the data record for previously released data may change. Otherwise, only the last several months of data are deemed provisional (up to 1.5 years of instrument data) and may be updated any time. Thus, only provisional data need to remain and be tracked in the Pachyderm system until the next data release. However, the Pachyderm processing system is designed to track any and all changes and removing the released data from the system must follow the below process.

Dial back to provisional data

The following steps will reduce the data at the head commit to the provisional period.

  1. Edit the date ranges in the pipeline specifications for the cron_daily_and_date_control and cron_monthly_and_pub_control pipelines to reflect the provisional data period.
  2. In a transaction, update the cron_daily_and_date_control and cron_monthly_and_pub_control pipelines with the --reprocess flag.
  3. Run each of the cron pipelines above (pachctl run cron <pipeline>) in order to place a head commit in their respective tick repos. Before running the monthly cron, ensure that all data for the previous month are processed, since this pipeline may trigger publication for the previous month.

Squash the commit history for released data

The following steps will remove the commit history related to the data that is no longer at the head commit.

  1. List the commits for the cron_daily_and_date_control pipeline. Start with the commit prior to the pipeline update that dialed back to provisional data (should be distinguished by the step change in commit size). In the example below, commit 78b3b95c94dc441e8b7324e49863efa6 should be the tick commit and a130341903494a03bb3efe1d2b6ebbfe should be the pipeline update from the section above. We will squash the history beginning with commit b399eb2087ec4f43a6313174cf359df2, note the larger size. Work backward from this commit (more recent commits first), following the process below.
$ pachctl list commit li191r_cron_daily_and_date_control

PROJECT REPO                               BRANCH COMMIT                           FINISHED     SIZE     ORIGIN DESCRIPTION
default li191r_cron_daily_and_date_control master 78b3b95c94dc441e8b7324e49863efa6 1 hour ago   9.28KiB AUTO
default li191r_cron_daily_and_date_control master a130341903494a03bb3efe1d2b6ebbfe 1 hour ago   8.61KiB AUTO
default li191r_cron_daily_and_date_control master b399eb2087ec4f43a6313174cf359df2 1 day ago    40.93KiB AUTO
default li191r_cron_daily_and_date_control master 11789940f9a546dbad597e4721979d1a 2 days ago   40.26KiB AUTO
default li191r_cron_daily_and_date_control master 031eeffe16534f9abe9bb577339a15fb 3 days ago   39.59KiB AUTO
default li191r_cron_daily_and_date_control master d3ed247ed800475c899446ec51d76ff9 4 days ago   38.92KiB AUTO
default li191r_cron_daily_and_date_control master 7bc7fd54dded488085a2f18ea2550a86 5 days ago   38.25KiB AUTO
default li191r_cron_daily_and_date_control master e75e36f80fca48bdbec5f52257ace701 6 days ago   37.58KiB AUTO
default li191r_cron_daily_and_date_control master 99f69cc12dcf48b3a9c8bf42f2f5e39f 7 days ago   36.91KiB AUTO
default li191r_cron_daily_and_date_control master fe9b9c041db546828e8be579d882174a 8 days ago   36.24KiB AUTO
default li191r_cron_daily_and_date_control master d954351f84ab47dc9d83fa9a161d266a 9 days ago   35.57KiB AUTO
...
  1. List the direct commits for this commit ID. Execute the following commands:
commit=<commit id>
pachctl list commit $commit |grep $commit

A list should be output. For example:

$ commit=b399eb2087ec4f43a6313174cf359df2
$ pachctl list commit $commit |grep $commit

default li191r_cron_daily_and_date_control_tick          master b399eb2087ec4f43a6313174cf359df2 2 days ago   0B       USER
default li191r_location_loader.meta                      master b399eb2087ec4f43a6313174cf359df2 2 days ago   147.6KiB AUTO
default parQuantumLine_group_loader.meta                 master b399eb2087ec4f43a6313174cf359df2 2 days ago   105.8KiB AUTO
default li191r_location_asset.meta                       master b399eb2087ec4f43a6313174cf359df2 2 days ago   9.462MiB AUTO
default parQuantumLine_group_loader                      master b399eb2087ec4f43a6313174cf359df2 2 days ago   99.69KiB AUTO
default parQuantumLine_threshold.meta                    master b399eb2087ec4f43a6313174cf359df2 2 days ago   573.5KiB AUTO
default li191r_calibration_list_files.meta               master b399eb2087ec4f43a6313174cf359df2 2 days ago   2.851MiB AUTO
default parQuantumLine_srf_loader2                       master b399eb2087ec4f43a6313174cf359df2 2 days ago   116.3KiB AUTO
default li191r_calibration_list_files                    master b399eb2087ec4f43a6313174cf359df2 2 days ago   2.845MiB AUTO
default parQuantumLine_srf_loader.meta                   master b399eb2087ec4f43a6313174cf359df2 2 days ago   122.4KiB AUTO
default li191r_location_asset                            master b399eb2087ec4f43a6313174cf359df2 2 days ago   9.456MiB AUTO
default test_location_asset_e55v2.meta                   master b399eb2087ec4f43a6313174cf359df2 2 days ago 30.2MiB  AUTO
default parQuantumLine_srf_loader2.meta                  master b399eb2087ec4f43a6313174cf359df2 2 days ago 122.4KiB AUTO
default testprod_group_loader_v4                         master b399eb2087ec4f43a6313174cf359df2 2 days ago 99KiB    AUTO
default li191r_location_loader                           master b399eb2087ec4f43a6313174cf359df2 2 days ago 141.5KiB AUTO
default parQuantumLine_srf_loader                        master b399eb2087ec4f43a6313174cf359df2 2 days ago 116.3KiB AUTO
...
  1. If the commit at the top of the list is a tick commit (corresponding to daily cron run), likely all that is needed is to squash the tick commit.
$ repo=li191r_cron_daily_and_date_control_tick # The tick repo
$ pachctl squash commitV2 --recursive repo@commit

If no error results from the above command, double check by listing the direct commits again. The commit should not exist, showing the following error:

$ pachctl list commit $commit

error from InspectCommitSet: no commits found for commitset b399eb2087ec4f43a6313174cf359df2

If so, the commit is successfully squashed and you may return to step 2 for the next commit in the original list (11789940f9a546dbad597e4721979d1a). In fact, you may attempt to do this step automatically for every commit you want to squash. For example:

$ commits=(11789940f9a546dbad597e4721979d1a 031eeffe16534f9abe9bb577339a15fb d3ed247ed800475c899446ec51d76ff9 7bc7fd54dded488085a2f18ea2550a86 e75e36f80fca48bdbec5f52257ace701 99f69cc12dcf48b3a9c8bf42f2f5e39f fe9b9c041db546828e8be579d882174a d954351f84ab47dc9d83fa9a161d266a) # The commits to squash
$ repo=li191r_cron_daily_and_date_control_tick # The tick repo
$ for commit in $(echo ${commits[*]}); do
> echo "pachctl squash commitV2 --recursive $repo@$commit"
> pachctl squash commitV2 --recursive $repo@$commit
> done
$ pachctl list commit li191r_cron_daily_and_date_control

There will very likely be errors. Not to worry. The command at the end of the sequence will regenerate the commits that need further steps in order to squash.

  1. If the commit at the top of the list generated in step 3 was not a tick commit, it was a pipeline update or a base repo commit. For example, note the spec commit at the top, indicating a pipeline update:
$ commit=99f69cc12dcf48b3a9c8bf42f2f5e39f
$ pachctl list commit $commit |grep $commit

default li191r_cron_daily_and_date_control.spec          master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 0B       USER
default li191r_cron_daily_and_date_control               master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 25.5KiB  AUTO
default li191r_cron_daily_and_date_control.meta          master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 35.87KiB AUTO
default parQuantumLine_level1_group_consolidate_srf.spec master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 0B       USER
default parQuantumLine_level1_group_consolidate_srf      master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 383.4MiB AUTO
default parQuantumLine_level1_group_consolidate_srf.meta master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 7.407GiB AUTO
default parQuantumLine_cron_monthly_and_pub_control.spec master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 0B       USER
default parQuantumLine_cron_monthly_and_pub_control      master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 0B       AUTO
...
Likely lots more below

Ignoring .spec and .meta entries and working your way from top to bottom, squash each commit and its associated .meta commit for each pipeline and/or repo in the list. For example:

$ repo=li191r_cron_daily_and_date_control
$ pachctl squash commitV2 --recursive repo@commit
$ pachctl squash commitV2 --recursive repo.meta@commit

$ repo=parQuantumLine_level1_group_consolidate_srf
$ pachctl squash commitV2 --recursive repo@commit
$ pachctl squash commitV2 --recursive repo.meta@commit

$ repo=parQuantumLine_cron_monthly_and_pub_control
$ pachctl squash commitV2 --recursive repo@commit
$ pachctl squash commitV2 --recursive repo.meta@commit

Note that base repo commits will not have an associated .meta commit, but it does not hurt to execute the commands anyway. Continue until there are only spec commits at the commit id. These cannot be squashed. For example:

$ pachctl list commit $commit |grep $commit

default li191r_cron_daily_and_date_control.spec          master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 0B       USER
default parQuantumLine_level1_group_consolidate_srf.spec master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 0B       USER
default parQuantumLine_cron_monthly_and_pub_control.spec master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 0B       USER
...

If there were no errors in the above process (other than the lack of a .meta commit for base repos), return to step 2 for the next commit.

  1. If errors resulted in steps 3 or 4, this indicates that a downstream pipeline update occurred that must be squashed before the present commit can be squashed. For example, the error should be similar to:
$ repo=li191r_cron_daily_and_date_control
$ commit=fe9b9c041db546828e8be579d882174a
$ pachctl squash commitV2 --recursive repo@commit

rpc error: code = NotFound desc = error checking child commit state: get commit by commit key: commit (int_id=0, commit_id=default/parQuantumLine_level1_group_consolidate_srf.user@d600e2d9581040e0b72f0a6be6a563d4) not found

Note the different commit ID in the error message (d600e2d9581040e0b72f0a6be6a563d4) compared to the commit intending to be squashed. This is the blocking commit. Using the ID of the blocking commit, follow steps 2-4 to squash the blocking commit and then return to (and squash) the original commit. If more errors result when trying to squash the blocking commit, continue deeper into the blocking commits until the blocking commit can be squashed, then work backward squashing the previously blocking commit and so on until the original commit can be squashed. Remember, a commit is successfully squashed when only .spec commits (or no commits at all) are shown when listing the commits for a particular commit ID.

  1. When you think you have successfully squashed the commit history for all but provisional data, list the commits for every pipeline in the DAG. Only commits more recent than the pipeline update that dialed back to provisional data should be shown.
$ pachctl list commit li191r_cron_daily_and_date_control

PROJECT REPO                               BRANCH COMMIT                           FINISHED     SIZE     ORIGIN DESCRIPTION
default li191r_cron_daily_and_date_control master 78b3b95c94dc441e8b7324e49863efa6 1 hour ago 9.28KiB AUTO
default li191r_cron_daily_and_date_control master a130341903494a03bb3efe1d2b6ebbfe 1 hour ago 8.61KiB AUTO
  1. In a few days, check the object store for pachd. It should have reduced in size considerably. Note that it can take a few weeks for the space to be reclaimed.
⚠️ **GitHub.com Fallback** ⚠️