An arxiv of grid errors - twongjirad/LArLiteSoftCookBook GitHub Wiki
Running on the grid often involves debugging weird errors by deciphering cryptic error messages. Here is a list of solved issues.
To get error messages, one should inpsect the logs of a job. To get them type
jobsub_fetchlog [email protected]
where [email protected]
is an example of a job id. You'll get a tar file which you can unpack (tar -zxvf [].tgz
). You'll see something like this
larlite-mcc7_cosmic_extbnb_detsim_to_larlite-v05_08_00.sh
larlite-mcc7_cosmic_extbnb_detsim_to_larlite-v05_08_00.sh_20161030_100952_1555257_0_1_.cmd
larlite-mcc7_cosmic_extbnb_detsim_to_larlite-v05_08_00.sh_20161030_100952_1555257_0_1_wrap.sh
larlite-mcc7_cosmic_extbnb_detsim_to_larlite-v05_08_00.sh_20161030_100952_1555257_0_1_.log
larlite-mcc7_cosmic_extbnb_detsim_to_larlite-v05_08_00.sh_20161030_100952_1555257_0_1_cluster.11390277.0.out
larlite-mcc7_cosmic_extbnb_detsim_to_larlite-v05_08_00.sh_20161030_100952_1555257_0_1_cluster.11390277.0.err
- the
.cmd
file is the job file sent to condor - the first script is a script that launches the larsoft job (calls
lar
) _wrap.sh
is the script that condor is told to run, which eventually calls the larsoft script.log
is the condor log file.out
is the standard output from the job.err
is the standard error output from the job
NOTES OF PAST ERRORS
Trouble determining application name.
This message is found in the .out
file. It comes from calling the launch script (.sh
) and involves the following block:
export ART_DEBUG_CONFIG=1
APPNAME=`lar -c $FCL 2>&1 > /dev/null | grep process_name: | tr -d '"' | awk '{print $2}'`
if [ $? -ne 0 ]; then
echo "lar -c $FCL failed to run. May be missing a ups product, library, or fcl file."
exit 1
fi
unset ART_DEBUG_CONFIG
if [ x$APPNAME = x ]; then
echo "Trouble determining application name."
exit 1
fi
What it is trying to do is dump the FCL configuration to standard out and then find the name of the job. As a test one can try to run the same command with the intended fcl file:
> ART_DEBUG_CONFIG=1 lar -c mcc7_cosmic_extbnb_detsim_to_larlite.fcl 2>&1 > /dev/null | grep process_name: | tr -d '"' | awk '{print $2}'
MCCEXTBNB2LL
possible solution: the FCL on the grid isn't working
Note that process_name
is usually right at the top of the output. This might mean that the fcl configuration cannot be completed, often from a missing fcl file when running on the grid.
If you modify a fcl file, you need to run mrb i
to make sure it gets installed in the correct location. You also need to run make_tar_uboone.sh
, which produces the tar file of your uboonecode copy that will run on the grid. (see grid instructions wiki page). For code being run on a worker node, the folder it looks at will be in
localProducts_larsoft_v05_08_00_e9_prof/uboonecode/v05_08_00_01/job
Note that v05_08_00_e9_prof
and v05_08_00_01
will change with your version of uboonecode.
Make sure that after you run mrb i
, all the fcl files you need are there. You can also double check the tarball from make_tar_uboone.sh
. Untar the one you think is getting sent to the grid. Verify that the fcl file is in the job folder of the tarball output.
specific example
I added a new lartify module to produce the data product that contains TPC wire statuses. I made a new fcl litemc_chstatus.fcl
and included it into my driver fcl file via #include "litemc_chstatus.fc"
. Running interactively on the gpvms, it worked just fine. But I failed to get it into the job
folder before I tarred up by copy of uboonecode. To fix it, I copied the fcl file into uboone/LiteMaker/job/litemc
, ran mrb i
, and verified it got installed into the localProducts job folder. I then tarred up the modified copy and tried again. This seem to fix the problem.
Signal 1 Hang-up
Possible causes
- Sometimes the build isn't proper.
Possible solutions
- Best practice is to get code fully debugged so it can build, check the changes into a branch back into the uboonecode repo, then do a FRESH install. Make sure during this fresh installation that everything builds properly. If something messes up, fix it, blow up the build (
mrb z
) then make and install (mrb i
) again. This has worked for me in the past (hat-tip to Kazu)
Signal 65
This is when two file formats are attempted to run together.