4.5 NEON data release and history (commit) squashing - NEONScience/NEON-IS-data-processing GitHub Wiki
Every year NEON issues a data release, which is an unchangeable snapshot of the entire data record for every data product. This is the only time each year that the data record for previously released data may change. Otherwise, only the last several months of data are deemed provisional (up to 1.5 years of instrument data) and may be updated any time. Thus, only provisional data need to remain and be tracked in the Pachyderm system until the next data release. However, the Pachyderm processing system is designed to track any and all changes and removing the released data from the system must follow the below process.
The following steps will reduce the data at the head commit to the provisional period.
- Edit the date ranges in the pipeline specifications for the
cron_daily_and_date_control
andcron_monthly_and_pub_control
pipelines to reflect the provisional data period. - In a transaction, update the
cron_daily_and_date_control
andcron_monthly_and_pub_control
pipelines with the--reprocess
flag. - Run each of the cron pipelines above (
pachctl run cron <pipeline>
) in order to place a head commit in their respective tick repos. Before running the monthly cron, ensure that all data for the previous month are processed, since this pipeline may trigger publication for the previous month.
The following steps will remove the commit history related to the data that is no longer at the head commit.
- List the commits for the
cron_daily_and_date_control pipeline
. Start with the commit prior to the pipeline update that dialed back to provisional data (should be distinguished by the step change in commit size). In the example below, commit78b3b95c94dc441e8b7324e49863efa6
should be the tick commit anda130341903494a03bb3efe1d2b6ebbfe
should be the pipeline update from the section above. We will squash the history beginning with commitb399eb2087ec4f43a6313174cf359df2
, note the larger size. Work backward from this commit (more recent commits first), following the process below.
$ pachctl list commit li191r_cron_daily_and_date_control
PROJECT REPO BRANCH COMMIT FINISHED SIZE ORIGIN DESCRIPTION
default li191r_cron_daily_and_date_control master 78b3b95c94dc441e8b7324e49863efa6 1 hour ago 9.28KiB AUTO
default li191r_cron_daily_and_date_control master a130341903494a03bb3efe1d2b6ebbfe 1 hour ago 8.61KiB AUTO
default li191r_cron_daily_and_date_control master b399eb2087ec4f43a6313174cf359df2 1 day ago 40.93KiB AUTO
default li191r_cron_daily_and_date_control master 11789940f9a546dbad597e4721979d1a 2 days ago 40.26KiB AUTO
default li191r_cron_daily_and_date_control master 031eeffe16534f9abe9bb577339a15fb 3 days ago 39.59KiB AUTO
default li191r_cron_daily_and_date_control master d3ed247ed800475c899446ec51d76ff9 4 days ago 38.92KiB AUTO
default li191r_cron_daily_and_date_control master 7bc7fd54dded488085a2f18ea2550a86 5 days ago 38.25KiB AUTO
default li191r_cron_daily_and_date_control master e75e36f80fca48bdbec5f52257ace701 6 days ago 37.58KiB AUTO
default li191r_cron_daily_and_date_control master 99f69cc12dcf48b3a9c8bf42f2f5e39f 7 days ago 36.91KiB AUTO
default li191r_cron_daily_and_date_control master fe9b9c041db546828e8be579d882174a 8 days ago 36.24KiB AUTO
default li191r_cron_daily_and_date_control master d954351f84ab47dc9d83fa9a161d266a 9 days ago 35.57KiB AUTO
...
- List the direct commits for this commit ID. Execute the following commands:
commit=<commit id>
pachctl list commit $commit |grep $commit
A list should be output. For example:
$ commit=b399eb2087ec4f43a6313174cf359df2
$ pachctl list commit $commit |grep $commit
default li191r_cron_daily_and_date_control_tick master b399eb2087ec4f43a6313174cf359df2 2 days ago 0B USER
default li191r_location_loader.meta master b399eb2087ec4f43a6313174cf359df2 2 days ago 147.6KiB AUTO
default parQuantumLine_group_loader.meta master b399eb2087ec4f43a6313174cf359df2 2 days ago 105.8KiB AUTO
default li191r_location_asset.meta master b399eb2087ec4f43a6313174cf359df2 2 days ago 9.462MiB AUTO
default parQuantumLine_group_loader master b399eb2087ec4f43a6313174cf359df2 2 days ago 99.69KiB AUTO
default parQuantumLine_threshold.meta master b399eb2087ec4f43a6313174cf359df2 2 days ago 573.5KiB AUTO
default li191r_calibration_list_files.meta master b399eb2087ec4f43a6313174cf359df2 2 days ago 2.851MiB AUTO
default parQuantumLine_srf_loader2 master b399eb2087ec4f43a6313174cf359df2 2 days ago 116.3KiB AUTO
default li191r_calibration_list_files master b399eb2087ec4f43a6313174cf359df2 2 days ago 2.845MiB AUTO
default parQuantumLine_srf_loader.meta master b399eb2087ec4f43a6313174cf359df2 2 days ago 122.4KiB AUTO
default li191r_location_asset master b399eb2087ec4f43a6313174cf359df2 2 days ago 9.456MiB AUTO
default test_location_asset_e55v2.meta master b399eb2087ec4f43a6313174cf359df2 2 days ago 30.2MiB AUTO
default parQuantumLine_srf_loader2.meta master b399eb2087ec4f43a6313174cf359df2 2 days ago 122.4KiB AUTO
default testprod_group_loader_v4 master b399eb2087ec4f43a6313174cf359df2 2 days ago 99KiB AUTO
default li191r_location_loader master b399eb2087ec4f43a6313174cf359df2 2 days ago 141.5KiB AUTO
default parQuantumLine_srf_loader master b399eb2087ec4f43a6313174cf359df2 2 days ago 116.3KiB AUTO
...
- If the commit at the top of the list is a tick commit (corresponding to daily cron run), likely all that is needed is to squash the tick commit.
$ repo=li191r_cron_daily_and_date_control_tick # The tick repo
$ pachctl squash commitV2 --recursive repo@commit
If no error results from the above command, double check by listing the direct commits again. The commit should not exist, showing the following error:
$ pachctl list commit $commit
error from InspectCommitSet: no commits found for commitset b399eb2087ec4f43a6313174cf359df2
If so, the commit is successfully squashed and you may return to step 2 for the next commit in the original list (11789940f9a546dbad597e4721979d1a
). In fact, you may attempt to do this step automatically for every commit you want to squash. For example:
$ commits=(11789940f9a546dbad597e4721979d1a 031eeffe16534f9abe9bb577339a15fb d3ed247ed800475c899446ec51d76ff9 7bc7fd54dded488085a2f18ea2550a86 e75e36f80fca48bdbec5f52257ace701 99f69cc12dcf48b3a9c8bf42f2f5e39f fe9b9c041db546828e8be579d882174a d954351f84ab47dc9d83fa9a161d266a) # The commits to squash
$ repo=li191r_cron_daily_and_date_control_tick # The tick repo
$ for commit in $(echo ${commits[*]}); do
> echo "pachctl squash commitV2 --recursive $repo@$commit"
> pachctl squash commitV2 --recursive $repo@$commit
> done
$ pachctl list commit li191r_cron_daily_and_date_control
There will very likely be errors. Not to worry. The command at the end of the sequence will regenerate the commits that need further steps in order to squash.
- If the commit at the top of the list generated in step 3 was not a tick commit, it was a pipeline update or a base repo commit. For example, note the spec commit at the top, indicating a pipeline update:
$ commit=99f69cc12dcf48b3a9c8bf42f2f5e39f
$ pachctl list commit $commit |grep $commit
default li191r_cron_daily_and_date_control.spec master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 0B USER
default li191r_cron_daily_and_date_control master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 25.5KiB AUTO
default li191r_cron_daily_and_date_control.meta master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 35.87KiB AUTO
default parQuantumLine_level1_group_consolidate_srf.spec master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 0B USER
default parQuantumLine_level1_group_consolidate_srf master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 383.4MiB AUTO
default parQuantumLine_level1_group_consolidate_srf.meta master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 7.407GiB AUTO
default parQuantumLine_cron_monthly_and_pub_control.spec master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 0B USER
default parQuantumLine_cron_monthly_and_pub_control master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 0B AUTO
...
Likely lots more below
Ignoring .spec and .meta entries and working your way from top to bottom, squash each commit and its associated .meta commit for each pipeline and/or repo in the list. For example:
$ repo=li191r_cron_daily_and_date_control
$ pachctl squash commitV2 --recursive repo@commit
$ pachctl squash commitV2 --recursive repo.meta@commit
$ repo=parQuantumLine_level1_group_consolidate_srf
$ pachctl squash commitV2 --recursive repo@commit
$ pachctl squash commitV2 --recursive repo.meta@commit
$ repo=parQuantumLine_cron_monthly_and_pub_control
$ pachctl squash commitV2 --recursive repo@commit
$ pachctl squash commitV2 --recursive repo.meta@commit
Note that base repo commits will not have an associated .meta commit, but it does not hurt to execute the commands anyway. Continue until there are only spec commits at the commit id. These cannot be squashed. For example:
$ pachctl list commit $commit |grep $commit
default li191r_cron_daily_and_date_control.spec master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 0B USER
default parQuantumLine_level1_group_consolidate_srf.spec master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 0B USER
default parQuantumLine_cron_monthly_and_pub_control.spec master 99f69cc12dcf48b3a9c8bf42f2f5e39f 3 weeks ago 0B USER
...
If there were no errors in the above process (other than the lack of a .meta commit for base repos), return to step 2 for the next commit.
- If errors resulted in steps 3 or 4, this indicates that a downstream pipeline update occurred that must be squashed before the present commit can be squashed. For example, the error should be similar to:
$ repo=li191r_cron_daily_and_date_control
$ commit=fe9b9c041db546828e8be579d882174a
$ pachctl squash commitV2 --recursive repo@commit
rpc error: code = NotFound desc = error checking child commit state: get commit by commit key: commit (int_id=0, commit_id=default/parQuantumLine_level1_group_consolidate_srf.user@d600e2d9581040e0b72f0a6be6a563d4) not found
Note the different commit ID in the error message (d600e2d9581040e0b72f0a6be6a563d4
) compared to the commit intending to be squashed. This is the blocking commit. Using the ID of the blocking commit, follow steps 2-4 to squash the blocking commit and then return to (and squash) the original commit. If more errors result when trying to squash the blocking commit, continue deeper into the blocking commits until the blocking commit can be squashed, then work backward squashing the previously blocking commit and so on until the original commit can be squashed. Remember, a commit is successfully squashed when only .spec commits (or no commits at all) are shown when listing the commits for a particular commit ID.
- When you think you have successfully squashed the commit history for all but provisional data, list the commits for every pipeline in the DAG. Only commits more recent than the pipeline update that dialed back to provisional data should be shown.
$ pachctl list commit li191r_cron_daily_and_date_control
PROJECT REPO BRANCH COMMIT FINISHED SIZE ORIGIN DESCRIPTION
default li191r_cron_daily_and_date_control master 78b3b95c94dc441e8b7324e49863efa6 1 hour ago 9.28KiB AUTO
default li191r_cron_daily_and_date_control master a130341903494a03bb3efe1d2b6ebbfe 1 hour ago 8.61KiB AUTO
- In a few days, check the object store for pachd. It should have reduced in size considerably. Note that it can take a few weeks for the space to be reclaimed.