Troubleshooting the pipeline - NCAR/kcor-pipeline GitHub Wiki

Preliminaries

The following solutions to common issues often require using the production pipeline. The kcor command referenced below is located at:

/hao/acos/św/pipeline/kcor-pipeline/bin/kcor

This file is a Python script, so you will have to have a Python 3 interpreter in your $PATH. You can use mine by putting the following directory in your $PATH:

/home/mgalloy/anaconda3/bin

If you need to run the pipeline, i.e., kcor rt, kcor eod, kcor process, etc as detailed below, then you will need a relatively safe way to have a terminal session open for a potentially long time, maybe several hours. Running it from its own terminal window on a Linux desktop at CG-1 is probably OK, but from a laptop at home is probably not. If the session is interrupted for some reason, you will have to clean up from the pipeline getting terminated as well as whatever caused you to run the pipeline in the first place. If I am at home, I use screen on a Linux server that I have ssh-ed to in order to launch long running processes.

Realtime issues

The following issues can be addressed, in some cases, on the observing day, but don't reprocess the current day!

The realtime pipeline crashes

If the realtime pipeline crashes in a standard manner, you will receive an email showing the error. You can also check out the realtime log in:

/hao/acos/kcor/logs/2022/YYYYMMDD.realtime.log

You can filter the results to find the error with:

$ kcor log -e /hao/acos/kcor/logs/2022/YYYYMMDD.realtime.log

But then might want to jump into an editor to examine the context of the error to determine what was happening right before the crash or what file was being processed at the time of the crash.

This doesn't happen frequently, so there are no common cases here.

If the error is a crash handling a few files that seem odd for some reason, try renaming the level 0 files to not end in .fts or .fts.gz. This should temporarily allow the pipeline to progress.

Validation errors

These are warnings and don't need to be dealt with immediately on the pipeline side. They are meant to alert us to the fact that the raw files have changed in some way the pipeline didn't expect. If you are OK with the change, ignore the warnings. If not, it is an issue for the observation code.

CME detection issues

There aren't any common issues here right now. To troubleshot issues, look in the CME log file:

/export/data1/Data/KCor/logs/YYYY/YYYYMMDD.cme.log

End-of-day issues

The following issues need to be addressed the day following the observing day. Don't reprocess the current day!

If you have set the pipeline to be run nightly, you should get an email notification of how the production pipeline ran the previous day. Note for HAO users: we run the pipeline at MLSO and in Boulder, so you should get two end-of-day emails.

The end-of-day pipeline fails to run

You would not get an end-of-day email in this case. This happens for two main reasons:

  • the machine log is not present, or
  • the realtime pipeline crashed and left a lock file.

If the machine log is not present, copy it from kodiak (as ldm, i.e., su ldm, perform the copy, and then exit back to your own login). Then run the end-of-day pipeline:

$ kcor eod -f production YYYYMMDD

If the realtime pipeline crashed and left a lock file, you have to:

  • remove the lock file, i.e., /hao/dawn/Data/KCor/raw/YYYYMMDD/.lock,
  • finish the realtime processing, and
  • run the end-of-day processing.

Finishing the realtime processing and running the end-of-day processing are just:

$ kcor rt -f production YYYYMMDD
$ kcor eod -f production YYYYMMDD

There are missing files

The end-of-day email will indicate that there are missing files.

The kcor command can get the missing files for you:

$ cd /hao/dawn/Data/KCor/raw/YYYYMMDD
$ kcor missing -f production -s YYYYMMDD > get_missing.sh
$ chmod +x get_missing.sh
$ su ldm
$ ./get_missing.sh
$ exit

The kcor missing command requires the machine log, so if the machine log is missing also, get that first.

If there were missing files, you now need to reprocess the day and archive the level 0 containing the missing files:

$ kcor process -f reprocess YYYYMMDD
$ kcor archive --level 0 -f production YYYYMMDD

The end-of-day pipeline crashes

There would be errors listed in the end-of-day email. You can also check out the end-of-day log in:

/hao/acos/kcor/logs/2022/YYYYMMDD.eod.log

You can filter the results to find the error with:

$ kcor log -e /hao/acos/kcor/logs/2022/YYYYMMDD.eod.log

But then might want to jump into an editor to examine the context of the error to determine what was happening right before the crash or what file was being processed at the time of the crash.

This doesn't happen frequently, so there are no common cases here.