9.0 Troubleshooting tips - NEONScience/NEON-IS-data-processing GitHub Wiki

First things first - determine if you need to troubleshoot

The following tips assume you have deployed your pipeline using pachctl create pipeline... or pachctl update pipeline --reprocess ... and did not get an immediate error at the command line. Unfortunately, that doesn't mean your pipeline ran successfully. There are a few checks you should do to ensure your pipeline created the expected output. If at any point these checks fail, look through the scenarios below for some troubleshooting tips.

Check that the job ran successfully by running pachctl list job --expand and looking for your pipeline near the top of the output. It may take a minute for the job to show up after you deploy the pipeline, and more time for the job to finish. Let's say I deployed the csat3_merge_data_by_location pipeline. When I list the jobs I want to see this (note the STATE = success):

$ pachctl list job --expand

ID                                PIPELINE                        STARTED         DURATION        RESTART    PROGRESS      DL       UL       STATE
77293233518f4f43a34936a735e0a17f  csat3_merge_data_by_location    1 minute ago    44 seconds      0          5 + 0 / 5     72.55MiB 85.66MiB success

If what you see in step 1 looks good, check that you actually have data in your output repository by executing pachctl list repo | grep <your pipeline name>. Note that a job can run successfully but produce no output. Following the above example, I want to see something like this (note the non-zero SIZE):

$ pachctl list repo | grep csat3_merge_data_by_location
NAME                              CREATED         SIZE (MASTER)   DESCRIPTION
csat3_merge_data_by_location      1 minute ago    85.7MiB         Output repo for pipeline csat3_merge_data_by_location.

Check that the structure of the output repo matches your expectations by executing pachctl glob file <your pipeline or repo name>@master:/**. Note that the ** at the end of that command will recursively list everything within the directory. You can get a non-recursive listing by omitting the ** and specifying the exact directory after the :. Continuing the example above, I want to see something like this:

$ pachctl glob file csat3_merge_data_by_location@master:/**
NAME                                                                      TYPE SIZE
/csat3                                                                    dir  85.7MiB
/csat3/2019                                                               dir  85.7MiB
/csat3/2019/01                                                            dir  85.7MiB
/csat3/2019/01/01                                                         dir  16.73MiB
/csat3/2019/01/01/CFGLOC106585                                            dir  16.73MiB
/csat3/2019/01/01/CFGLOC106585/data                                       dir  16.72MiB
/csat3/2019/01/01/CFGLOC106585/data/csat3_CFGLOC106585_2019-01-01.parquet file 16.72MiB
/csat3/2019/01/01/CFGLOC106585/location                                   dir  6.715KiB
/csat3/2019/01/01/CFGLOC106585/location/csat3_46585_locations.json        file 6.715KiB
/csat3/2019/01/02                                                         dir  17.02MiB
/csat3/2019/01/02/CFGLOC106585                                            dir  17.02MiB
/csat3/2019/01/02/CFGLOC106585/data                                       dir  17.01MiB
/csat3/2019/01/02/CFGLOC106585/data/csat3_CFGLOC106585_2019-01-02.parquet file 17.01MiB
/csat3/2019/01/02/CFGLOC106585/location                                   dir  6.715KiB
/csat3/2019/01/02/CFGLOC106585/location/csat3_46585_locations.json        file 6.715KiB
/csat3/2019/01/03                                                         dir  17.27MiB
/csat3/2019/01/03/CFGLOC106585                                            dir  17.27MiB
/csat3/2019/01/03/CFGLOC106585/data                                       dir  17.27MiB
/csat3/2019/01/03/CFGLOC106585/data/csat3_CFGLOC106585_2019-01-03.parquet file 17.27MiB
/csat3/2019/01/03/CFGLOC106585/location                                   dir  6.715KiB
/csat3/2019/01/03/CFGLOC106585/location/csat3_46585_locations.json        file 6.715KiB

If you really want to be thorough, you should download one or more of those files into your local environment and open it up. Download it using the pachctl get file command. For example:

$ pachctl get file csat3_merge_data_by_location@master:/csat3/2019/01/01/CFGLOC106585/data/csat3_CFGLOC106585_2019-01-01.parquet -o ~/my_test_data/csat3_CFGLOC106585_2019-01-01.parquet

The path after the -o flag is where you want to put the file in your local environment (assuming Linux). Open it up in an appropriate program like Rstudio. For parquet files use the NEONprocIS.base::def.read.parq function. For example:

> myData <- NEONprocIS.base::def.read.parq(NameFile='~/my_test_data/csat3_CFGLOC106585_2019-01-01.parquet')
> myData
   source_id site_id        readout_time ux_wind_speed uy_wind_speed uz_wind_speed speed_of_sound
1      46585    CPER 2019-01-01 00:00:00      -0.35900      -7.18375      -0.17425        322.658
2      46585    CPER 2019-01-01 00:00:00      -0.40725      -7.17200      -0.09550        322.639
3      46585    CPER 2019-01-01 00:00:00      -0.51550      -7.11800      -0.05375        322.636
4      46585    CPER 2019-01-01 00:00:00      -0.43600      -6.94725      -0.15575        322.637
5      46585    CPER 2019-01-01 00:00:00      -0.83425      -6.97250      -0.08900        322.635
6      46585    CPER 2019-01-01 00:00:00      -0.75225      -7.09700       0.03025        322.635
...

If all looks as expected, congratulations! If not, one of the following scenarios probably applies. Read on!

No job is created

If you deployed your pipeline and gave it a minute for the job to show up but you still don't see it in the job list, then Pachyderm could not deploy your pipeline. Find out the reason by executing pachctl inspect pipeline <your pipeline name>. A common reason is that you misspelled the Docker image name or the image tag is no longer available (because a more recent tag is available). For example, I purposely specified an image tag that I know no longer exists. No job was created. I inspect the pipeline to find out why (showing only the beginning of the output):

$ pachctl inspect pipeline csat3_merge_data_by_location

Name: csat3_merge_data_by_location
Created: 28 seconds ago
State: crashing
Reason: rpc error: code = Unknown desc = Error response from daemon: manifest for quay.io/battelleecology/neon-is-loc-data-trnc-comb-r:v0.0.1 not found: manifest unknown: manifest unknown
Workers Available: 0/1
Stopped: false
...

Yes, the Reason is a bit cryptic but you'll get to recognize the errors.

Another reason why a job may not be created is that the pipeline is stuck in Starting status. See next section.

The pipeline is stuck in Starting status

If no job was created and pachctl inspect pipeline or pachctl list pipeline shows the state as starting, it is typically because you a recently deployed an upstream pipeline that has not run any jobs yet (because of an error you will probably find in the Reason field described in the section above). Look at upstream pipelines for the problem.

A job completed successfully but you see nothing in the output repo

If your job shows success but you see nothing in the output repo, it's probably because the glob pattern for one of the input repos in your pipeline spec is malformed. A closer look at the job list will probably show something like this:

$ pachctl list job --expand 

ID                                PIPELINE                       STARTED      DURATION            RESTART PROGRESS    DL UL STATE
21466b1836d648a59e57b170642a3cae  csat3_merge_data_by_location   1 second ago Less than a second  0       0 + 0 / 0   0B 0B success

Note the PROGRESS field shows 0 + 0 / 0. The 1st number is the datums the job processed, the 2nd is the number of datums it skipped, and the 3rd is the total number of datums identified. If this 3rd number shows 0, that means Pachyderm read your glob pattern and found nothing that matched it. Here's the relevant snipped of the pipeline specification I loaded to generate this scenario:

  "input": {
    "pfs": {
      "name": "DIR_IN",
      "repo": "csat3_structure_repo_by_location",
      "glob": "/prt/*/*/*"
    }
  },

Notice that the input repo seems to indicate that that it holds the csat3 source type, but my glob pattern indicates that datums are located in the /prt directory. So while is it theoretically possible that a /prt could exist in the csat3_structure_repo_by_location repository, one does not exist in reality and so Pachyderm found no datums. When you see 0 total datums identified in the job listing, closely inspect the glob patterns in your pipeline spec and ensure that the paths actually exist in the input repos. Also see Pachyderm's documentation on Glob patterns.

The job for your pipeline failed

First, it's important that you get to the pipeline that is failing furthest upstream. Once a pipeline fails, ALL downstream pipelines will fail. Usually it is easiest to find which pipeline is responsible by looking at the job list (pachctl list job --expand). Look down the list until you see the first one that failed.

$ pachctl list job --expand

ID                               PIPELINE                              STARTED            DURATION           RESTART PROGRESS    DL   UL STATE
f5db68e9232343f8bbe6a48ba4e5f16f tempAirTriple_prt                     18 seconds ago     Less than a second 0       0 + 0 / 0   0B   0B failure: inputs failed: IN_PATH
6f524a9fdafe42e1b5d0696959df9136 tempAirTriple_related_location_group  29 seconds ago     Less than a second 0       0 + 0 / 0   0B   0B failure: inputs failed: DATA_PATH
cf4dabc2b1684aa5b570985d5a76adac tempAirTriple_csat3_group_path        37 seconds ago     Less than a second 0       0 + 0 / 0   0B   0B failure: inputs failed: SOURCE_PAT...
83722d4a2c4f4034a6e94a396267eb8e csat3_merge_data_by_location          52 seconds ago     8 seconds          0       0 + 0 / 5   0B   0B failure: datum failed
ba013ba18cec46798b3f6a8b0f4a84ad csat3_structure_repo_by_location      About a minute ago 4 seconds          0       5 + 0 / 5   0B   0B success

Note also that the STATE for the csat3_merge_data_by_location pipeline says failure: datum failed while the downstream pipelines (higher up in the list) say some iteration of failure: inputs failed.... It's the pipeline with a STATE of failure: datum failed where your problem lies. If the job list is too cluttered with other pipelines, you can also use pachctl list pipeline to find the failing pipeline furthest upstream.

Once you know which pipeline is failing, it's time to look at the logs to see what's going wrong. At this point, the code is encountering an error and hopefully the logs will tell you why. See the Look at Logs section in the Useful Pachyderm Commands page in this Wiki to make sure your pipeline is in a state where you will be able to see the logs. To look at the logs, use pachctl logs --job=<failing pipeline>@<failing job ID>. In the example above, the logs show this:

$ pachctl logs --job=csat3_merge_data_by_location@83722d4a2c4f4034a6e94a396267eb8e 
Fatal error: cannot open file '/flow.loc.data.trnc.comb.bad.R': No such file or directory

In this case, I specified the wrong name of the code to run in the pipeline specification. The name of the code I was supposed to run was /flow.loc.data.trnc.comb.R which was waiting and ready in the Docker container. Hopefully the logs will show you as easy an error as this to fix, but probably not. To get the most detailed logs, set the LOG_LEVEL environment parameter in the pipeline specification to DEBUG.

Some common problems that cause errors in the code are:

An output schema that doesn't match the number of columns in the output
A bad input parameter to the code (specified in the pipeline spec)
Files you created for the empty_files repo that aren't consistent with the actual data files (wrong column names or incorrect file name)

If the problem isn't obvious and fixable from the logs, you may have to download some of the data in the input repo to your local environment and possibly step through the code. This is time consuming, so check the logs, input parameters, and input data first and read any documentation in the header of the code that was run. You can find what code is run by looking at the pipeline spec. For example, here's the relevant snippet of the pipeline spec for the csat3_merge_data_by_location pipeline, this time with the correct name of the code:

{
  "pipeline": {
    "name": "csat3_merge_data_by_location"
  },
  "transform": {
    "cmd": [
      "Rscript",
      "/flow.loc.data.trnc.comb.R",
      "DirIn=$DIR_IN",
      "DirOut=/pfs/out",
      "DirSubCombData=data",
      "DirSubCopy=location"
    ],
    "image": "quay.io/battelleecology/neon-is-loc-data-trnc-comb-r:v0.0.23",
    "image_pull_secrets": [
      "battelleecology-quay-read-all-pull-secret"
    ],
    "env": {
      "LOG_LEVEL": "DEBUG",
      "PARALLELIZATION_INTERNAL": "1"
    }
  },
...

The transform:cmd section of the pipeline spec shows that the R script flow.loc.data.trnc.comb.R was run, with several input parameters specified in subsequent lines. You can find this code in the /flow/flow.loc.data.trnc.comb folder of this Git repo. All the science code has detailed descriptions of the valid input parameters. Note also in the pipeline spec above that the LOG_LEVEL (toward the bottom) is set to DEBUG.

The output structure or file contents are not what you expected

If you get output in your output repo but the repo structure is wrong or the file contents are wrong, chances are the section above on The job for your pipeline failed is the path to figuring out what's wrong. Check out the documentation for the code and make sure your glob pattern and input parameters are correct. If you think you've got everything right, you may need to step through the code with test data or seek help.