9.1 Useful Pachyderm Commands - NEONScience/NEON-IS-data-processing GitHub Wiki

Change between pachyderm environments

Option 1: Enter the following command in the terminal (change nonprod to whatever is appropriate)

  1. pachctl config set active-context nonprod

Option 2: Manually edit the config file:

  1. In the terminal: vi ~/.pachyderm/config.json
  2. Type i
  3. Change the active_context field to nonprod (or whatever is appropriate)
  4. Hit ESC
  5. Type :wq

Look at what your pipeline is doing

Using pachctl inspect pipeline <your_pipeline_name> lets you see the status of your pipeline, how it was configured, and sometimes why it failed.
Failure reasons are only printed if the docker container for your pipeline doesn't initialize. This is usually the case if you specified an image in your pipeline spec that doesn't exist, so check the spelling and image version. If everything looks fine, or the reason for failure is not apparent, check out the job logs for the pipeline.

The following command lists the recent jobs for all pipelines:

pachctl list job --expand

If you see a failure for your pipeline, first look to see if an upstream pipeline failed. This will automatically cause failure for all downstream pipelines. Let's assume that your pipeline is where the problem started, or maybe it was successful but you still want to see what happened in the code when the job ran. Take note of the job ID for your pipeline and see the section below to look at the logs for that job.

Show only your jobs/pipeline/repos

You can pipe the pachctl list job/pipeline/repo output through grep, which matches the indicated string, like so:

pachctl list pipeline | grep <SENSOR>

Look at logs

You can view logs produced by your pipeline by using the pachctl logs command. Logs are available for the last 24 hours. To view all logs for the pipeline produced in the last 24 hours, use:

pachctl logs --pipeline=<YOUR PIPELINE NAME>

If your pipeline is still running, the -f option keeps the connection open so that any more logs are printed to screen as they come in:

pachctl logs -f --pipeline=<YOUR PIPELINE NAME>

Hit cntl-c to exit the connection.

To view logs for a particular job that has run for your pipeline, first find the job ID:

pachctl list job --expand --pipeline=<YOUR PIPELINE NAME>

Which will produce something like:

ID                               PIPELINE                  STARTED       DURATION  RESTART PROGRESS  DL UL STATE   
2efb42c31b444f338991b14be0874ad4 pipeline_name             9 minutes ago 4 seconds 0       0 + 5 / 5 0B 0B failure 

Copy the job ID, and use it in the following:

pachctl logs --job=<YOUR PIPELINE NAME>@<YOUR-JOB-ID-FROM-ABOVE>

Note that you can also use the -f option here to keep the connection open for future logs.

If nothing prints to screen then nothing has been logged. This could be because your job hasn't started yet, the pipeline failed before it made it to the user code, or you might need to set a more detailed logging level.

Reprocessing a pipeline (nominal)

pachctl update pipeline --reprocess -f ~[PATH TO FILE]

Example: pachctl update pipeline --reprocess -f ~/R/NEON-IS-data-processing/pipe/pqs1/pqs1_merge_data_by_location.json

Reprocessing a pipeline without having to reload the file for the pipeline spec

For pipeline specs in json format: pachctl inspect pipeline [pipeline_name] --raw -o json | pachctl update pipeline --reprocess

For pipeline specs in yaml format: pachctl inspect pipeline [pipeline_name] --raw -o yaml | pachctl update pipeline --reprocess

To save the current pipeline to a file (instead of reloading it): pachctl inspect pipeline [pipeline_name] --raw -o json > [/path/to/new/file.json]

Downloading a repo from Pachyderm to local

pachctl get file -r [REPO_NAME]@master:/ -o /path/to/place/output/locally

Putting a bunch of files in a pachyderm repo under a single commit

If you just have one file to upload into a pachyderm repo, you can use the standalone pachctl put file command:

pachctl put file <repo>@<branch>:</path/to/file> -f </path/to/local/file>

If you have a whole folder of files to put into a pachyderm repo, you can put the whole folder in using:

pachctl put file -r <repo>@<branch>:</path/to/folder> -f </path/to/local/folder>

Note the -r flag, which means recursive.

If you have a bunch of files or folders that cannot be put into the pachyderm repo with a single pachctl put file... command, it's important to start a commit, put the files, then finish the commit. Why? Every time you use pachctl put file... as a standalone command it creates a commit in the repo you are placing your files in. Each commit will result in a processing job for the pipelines downstream of the repo. This can create a lot of processing overhead, especially if the chain of downstream pipelines is long and you run the command multiple times. Putting all the files in under a single commit is as simple as this:

pachctl start commit <repo>@<branch>
pachctl put file ... as many times as you need to
pachctl finish commit <repo>@<branch>

If you want to be really savvy, use the commit ID that is generated from the pachctl start commit <repo>@<branch> command when you put the files in:

pachctl start commit <repo>@<branch>
3jsnv095mkd0mdjvghasklw305612
pachctl put file <repo>@3jsnv095mkd0mdjvghasklw305612:</path/to/file1>
pachctl put file <repo>@3jsnv095mkd0mdjvghasklw305612:</path/to/file2>
pachctl put file <repo>@3jsnv095mkd0mdjvghasklw305612:</path/to/file3>

Of course, replace that unique ID with the one that is output to the screen after you start the commit. What this allows you to do is view what you've done to make sure all is well before finishing the commit and kicking off a job. First, view what you've done:

pachctl list file <repo>@3jsnv095mkd0mdjvghasklw305612:/**

If you see a problem, you can always start over by deleting the commit before you finish it:

pachctl delete commit <repo>@3jsnv095mkd0mdjvghasklw305612

If everything looks good, then finish the commit:

pachctl finish commit <repo>@3jsnv095mkd0mdjvghasklw305612

Standing up a whole DAG

Rob Markel wrote a python script to stand up large sections of a DAG, assuming you've created all the pipeline specs and the data_source pipelines have been set up (data_source_<SENSOR>_site, data_source_<SENSOR>_linkmerge, data_source_<SENSOR>_list_years). In your terminal window, navigate to your local instance of the NEON-IS-data-processing Git repo, and specifically to the utilities folder. For example:

cd ~/R/NEON-IS-data-processing/utilities

From there, run the following:

python3 -B -m dag.create_dag  --spec_dir=<path to the folder where the pipeline specs are> --end_node_spec=<path to the last pipeline spec in the DAG> 

For example:

python3 -B -m dag.create_dag --spec_dir=/home/NEON/csturtevant/R/NEON-IS-data-processing-homeDir/pipe/exo --end_node_spec=/home/NEON/csturtevant/R/NEON-IS-data-processing-homeDir/pipe/exo/exo2_named_location_filter.json

Note that the paths you put into the arguments must be absolute paths (don't use e.g. ~/R/...). If a DAG is complicated you may get some “Pipeline not found” messages in the output when the script runs. You can run the script repeatedly until these disappear. It does not cause any issue to run the script more than once. Note that the script above will only stand up the pipeline specs within a single folder, so if your DAG is spread across multiple folders, you'll need to run the script for each folder.

If you are working on the som development server, all the python packages needed in order to run the script are already installed. No need to read further. If not, you'll need python3 installed. Once you've done that you'll need to install dependency packages. To do so, navigate to the utilities/dag folder of your local NEON-IS-data-processing Git repo. Then:

sudo pip3 install -r requirements.txt
sudo python3 -m pip install graphviz
sudo yum install graphviz

Deleting a whole DAG

Similar to standing up a whole DAG (above), you can delete a whole DAG. In your terminal window, navigate to your local instance of the NEON-IS-data-processing Git repo, and specifically to the utilities folder. For example:

cd ~/R/NEON-IS-data-processing/utilities

From there, run the following:

python3 -B -m dag.delete_dag  --spec_dir=<path to the folder where the pipeline specs are> --end_node_spec=<path to the last pipeline spec in the DAG> 

For example:

python3 -B -m dag.delete_dag --spec_dir=/home/NEON/csturtevant/R/NEON-IS-data-processing-homeDir/pipe/exo --end_node_spec=/home/NEON/csturtevant/R/NEON-IS-data-processing-homeDir/pipe/exo/exo2_named_location_filter.json

You'll probably get a bunch of warnings for each pipeline you delete, like:

WARNING: If using the --split-txn flag, this command must run until complete. If a failure or incomplete run occurs, then Pachyderm will be left in an inconsistent state. To resolve an inconsistent state, rerun this command.

Accept the warning with a y each time and rerun the whole command until until all the related pipelines are deleted. This may take a long time and several passes.

The same notes in the section above about using absolute paths and installing dependencies also apply here. See the "Standing up a whole DAG" section above.

You may encounter provenance errors while deleting the pipelines in the DAG. These typically result from someone having force-deleted one or more pipelines in the past that they shouldn't have. As a reminder, never force-delete a pipeline that is an input to another pipeline. These errors look something like this:

error fixing commit subvenance tempSoil_calibrated_data/1433908606244064999599c74485acb5: /pachyderm_pfs/commits/tempSoil_calibrated_data/ 1433908606244064999599c74485acb5 not found

These errors are a problem because they won't allow you to delete the pipeline or any pipelines upstream from it. If you get a provenance error when deleting a pipeline that has no other downstream pipelines attached, try this:

pachctl fsck --fix

This command may take a few hours, and hopefully after it completes you will be able to delete the full DAG.

Updating a whole DAG

Similar to standing up a whole DAG (above), you can update a whole DAG, with or without reprocessing. In your terminal window, navigate to your local instance of the NEON-IS-data-processing Git repo, and specifically to the utilities folder. For example:

cd ~/R/NEON-IS-data-processing/utilities

From there, run the following if you want to update all the pipelines in a DAG without reprocessing:

python3 -B -m dag.update_dag  --spec_dir=<path to the folder where the pipeline specs are> --end_node_spec=<path to the last pipeline spec in the DAG> 

For example:

python3 -B -m dag.update_dag --spec_dir=/home/NEON/csturtevant/R/NEON-IS-data-processing-homeDir/pipe/exo --end_node_spec=/home/NEON/csturtevant/R/NEON-IS-data-processing-homeDir/pipe/exo/exo2_named_location_filter.json

If you want to update with reprocessing, replace dag.update_dag in the code above with dag.update_reprocess_dag. For example:

python3 -B -m dag.update_reprocess_dag --spec_dir=/home/NEON/csturtevant/R/NEON-IS-data-processing-homeDir/pipe/exo --end_node_spec=/home/NEON/csturtevant/R/NEON-IS-data-processing-homeDir/pipe/exo/exo2_named_location_filter.json

The same notes in the "Standing up a whole DAG" section above about using absolute paths and installing dependencies also apply here.

NOTE The update script above does all the updating in a single pachyderm transaction. If you cancel execution of the script before it completes, or it errors somewhere in the middle, you MUST complete or delete the transaction manually using pachct finish transaction or pachct delete transaction <transaction-id>, respectively. Otherwise, nothing you do afterward will happen until the transaction is completed/deleted.

Graphing a DAG

There is also a handy utility to graph the DAG based on the pipeline specs in a given folder. In your terminal window, navigate to your local instance of the NEON-IS-data-processing Git repo, and specifically to the utilities folder. For example:

cd ~/R/NEON-IS-data-processing/utilities

From there, run the following:

python3 -B -m dag.graph_dag  --spec_dir=<path to the folder where the pipeline specs are> --end_node_spec=<path to the last pipeline spec in the DAG> 

For example:

python3 -B -m dag.graph_dag --spec_dir=/home/NEON/csturtevant/R/NEON-IS-data-processing-homeDir/pipe/exo --end_node_spec=/home/NEON/csturtevant/R/NEON-IS-data-processing-homeDir/pipe/exo/exo2_named_location_filter.json

This will create a pdf graphic showing how all of the pipelines are connected. You might get an error on Linux saying xdg-open: no method available for opening 'pipeline-graph.pdf'. Just cntl-C out of that and open the file from Rstudio. The file will be saved in the utilities folder of the Git project, so make sure to delete it or move it somewhere else before committing your work. Note that the visual organization is done automatically, so it may not match the linear flow you might expect, but it is very useful to show how the pipelines are connected.

The same notes in the sections above about using absolute paths and installing dependencies also apply here. See the "Standing up a whole DAG" section above.

Putting all pipelines on standby

for pipe in $(pachctl list pipeline --raw | jq -r '.| select(.state=="PIPELINE_RUNNING")|.pipeline.name'); do
echo "Putting pipeline $pipe on standby"
pachctl extract pipeline $pipe -o json | jq -r '.standby = true' | pachctl update pipeline
done

Deleting commits prior to a certain date

Note: This can only be done on a repo without any upstream pipelines. Earliest commit is deleted first.

#Be sure to edit the repo name and the date string below.
export repo=<your repo name>
for commit in $(pachctl list commit $repo@master --raw |jq -r '.|select(.finished <= "2021-05-01T00:00:00")|.commit.id' | tac); do
echo "Deleting commit $repo@$commit"
pachctl delete commit $repo@$commit
done

Deleting pipelines and repos indiscriminately (NEVER DO THIS)

Pachyderm, and common sense, really want you to delete repos 'backwards' from the end of the pipeline to the start. However, you may want to circumvent this and wholesale delete a middle repo using the --force switch.

This will immediately delete the pipeline, even if there are other pipelines dependent on it. DON'T DO THIS. Not only will this break downstream pipelines/repos, but a bunch of provenance errors will be created and may not be able to be fixed without wiping away the entire pipeline. I promise, you'll regret it using the --force option.

⚠️ **GitHub.com Fallback** ⚠️