Running uboonecode on fermigrid - twongjirad/LArLiteSoftCookBook GitHub Wiki

Running uboonecode/larsoft on fermigrid

These steps are primarily for MicroBooNE.

Preamble/ramblings

pre-reqs

These instructions assume you have checked out a copy of uboonecode. (Click here to read about how to perform this first step.) If you have made changes to the uboonecode source that you want to run, we also assume you have successfully been able to compile the code. Also, remember to have run mrb i -j4 to properly install the code into the localProducts folder of your custom build. The binaries in this localProducts folder is what will be packaged up, sent to the worker nodes, and run.

submission tool: project.py

We use a tool called project.py which manages the submission of the jobs and the subsequent tracking of a job's success. It is a python script which operates with the jobsub tool, which is Fermilab's interface tool to condor -- which is the job submission system used by Fermigrid.

project xml

project.py is configured using a project xml file. We have to give the xml file information you might expect:

  • the location of the code to run (our packaged up localProducts folder)
  • the location of the output
  • the input file lists (there are a couple of ways to define this)
  • meta data for controlling job behavior and node requirements

input data and network drives

For the input list, you have basically two choices:

  • a text file containing the full paths of the files
  • the more common choice: a filelist defined by a "samweb definition" samweb is a database tool to register/track files stored on the /pnfs network drive system. As a network drive system (and a heavily used one at that -- it is servicing all Fermilab experiments as far as I know), you have to be a little careful about how it's used.

One important feature of /pnfs, also known as "dcache", is that files can be stored on the Fermilab tape drive system. Many of MicroBooNE's official data and MC files are stored there. This fact is, in principle, suppose to be transparent to the user. If you access a file that is on tape, the system will ask a robot to grab the tape the data is on, and then copy it to some location on /pnfs. But it's important to be mindful of this situation, because if you try to access files on tape in a poor manner, you will cause a giant bottleneck to the entire system. (e.g. asking for many individual transfers.)

To avoid bottleneck issues when we are trying to process a fairly large data sample, we will chop up large dataset definition, request to "pre-stage" the files as a block to /pnfs, and then launch our jobs.

output data and network drives

Note that /pnfs is basically the only folders you should access on the worker nodes other than the /cvmfs drives where the larsoft software packages can be accessed. In /pnfs there are two locations we can use

  • /pnfs/uboone/persistant/users
  • /pnfs/uboone/scratch/users As the name suggests, scratch is a relatively temporary space. Files that have not been accessed for a couple of weeks will be deleted, leaving room for others to write to it. This is the place you should try to write your files if you know you only need them for a short time. persistant is for more long term files. It is also a more limited resource, I believe. One must try to avoid storing unneeded files on persistant as much as possible.

tip to avoid relatively frequent error

Note: sometimes you'll see jobs end very early. There are many reasons for this and you'll have to grab the job logs to figure out what happened (more on this below). But one relatively common problem is that something is messed up in the localProducts build. One tip is to blow up the build and compile/install from scratch. To do this:

$> mrb z
$> cd [to the localProducts folder]
$> rm -r *
$> mrbsetenv
$> mrb i -j4

The first command deletes the build folder. The second/third command destroys the exisiting binaries in the localProducts folder.

Steps in brief

  • re-setup your uboonecode environment (if not already setup). test: does which lar find a binary?

  • tar up your localProducts directory. From your uboonecode's top directory (to go there use: cd $MRB_TOP)

    $> make_tar_uboone.sh larsoft.tar
    

    Note that larsoft.tar can be anything you want.

  • make a work directory. Often, I will make a directory in /uboone/app/users/[username]/

  • in the work directory, we need to setup a project.py xml file

  • either identify a dataset definition you want to use or create one. If the data set is large, chop it up such that about 1000 files are in each subset

  • pre-stage the data set

    $> samweb prestage-dataset --defname=[dataset definition]
    
  • launch the jobs

    $> project.py --xml [project xml] --stage [project stage] --submit
    
  • wait. to check status of your jobs you can run:

    $> jobsub_q --user [username]
    

    (this is a good command to create an alias for. in my ~/.bash_profile I have

    `alias qstat='jobsub_q --user [myusername]'`
    
  • when jobs are done, update the project status using

    $> project.py --xml [project sml] --stage [project stage] --check[ana]
    

    the last argument is check if you are creating larsoft (art) files. It's checkana if you are making larlite or larcv files.

  • if you have more jobs to submit in your dataset definition run

    $> project.p --xml [project xml] --stage [project stage] --makeup
    
  • repeat this and the previous step until all the jobs are finished, or you are satisfied

Detailed Explanations

Uboonecode environment/setup tarball

We assume you've already have a copy of uboonecode, that it's built, and you have tested that your job runs fine. Make sure to test things first, so you don't waste time on the grid.

If you need files to test with check out:

  /uboone/data/users/tmw/dl_test_files

You'll find a number of different types of files, from MC to data, EXTBNB to BNB events.

If the code is OK, we must make a tarball of the localProducts folder. This will be shipped to the worker node, unpacked, setup, and the run. This is how you get your version to run, instead of using the tagged versions built and installed on /cvmfs.

$> cd $MRB_TOP
$> make_tar_uboone.sh larsoft.tar

project.py xml

An example (as of Dec 18) of a project.py xml can be found here. This particular xml file is launching jobs that run the larsoft-to-larcv conversion routines, Supera.

You can get a list of tags and their definitions by typing:

$> project.py -xh

which dumps info that you can find here

Some key tags and notes in the example:

  • everything in this block is being used as variables for later in the script. Mostly to do with building file paths for output and log files.

    <!DOCTYPE project [
    ...
    ]>
    
  • when running events from an input def, <numevents>1000000</numevents>, isn't really being used

  • the info in the <larsoft> block refers to the tarball made in the previous step

  • all fcl files in <fcldir> will take priority over those in the default locations

  • every worker node will run lar -c [fcl file] -s [input]. <stage><fcl> determines the fcl file

  • <inputdef> determines the input. Here, this tells the project to get files from samweb defintions, which we will be described later. If using a textfile to list files, use <inputlist>.

Defining input datasets

For now we are assuming we want to process some set of input files. Instructions will differ if you are trying to generate simulation files from scratch. (not covered here.)

One has two options:

  • define an input data set by placing full paths into a text file. The files should be living on /pnfs
  • define an input data set using a samweb definition

The former is straightforward, so we will discuss the latter, which is more common.

Official data and MC data sets are listed on the Analysis Tools website.

Here you'll find a list of data sets, under the header "Data sets". As an example, let's look at the most current MC sample (at the time of this writing). Click here to link. You'll find a number of different datasets for different event samples. Also, the stage at which they have been processed is also indicated.

  • Detsim covers the neutrino (or single particle) generation, particle propagation through the detector (geant4), and then the detector simulation which fills the TPC and photodetector responses.
  • Reco1 is low level reco -- noise filtering, TPC waveform deconvolution, hit finding on the TPC, flash finding on the photodetector waveforms
  • Reco2 is high-level reco -- track/shower finding, vertex proposals
  • anatree -- here the info in larsoft files are translated into simple ntuples
  • larcv (eventually) -- files store the event images
  • larlite (eventually) -- files that store data in larlite format for use for DL analyses

Let's say we are interested in simulated neutrinos+cosmics past reco2, this is prodgenie_bnb_nu_cosmic_uboone. We can get the samweb defintion by clicking describe. If one does so, one will go to this page and will find, at the top, the data set defintion, listed. Here it's prodgenie_bnb_nu_cosmic_uboone_mcc8.4_reco2.

One can go back to the terminal and use the samweb command line interface to learn more about the definition. Some useful commands:

  • to dump the files in the definition

    samweb list-definition-files [defname]
    
  • to get the location of the file, grab one of the file names and use the following command

    samweb locate-file [filename]
    

    note that you often will not use the file location directly unless troubleshooting a specific file.

If you were to dump the file list in the definition and count them (samweb list-definition-files [defname] | wc -l) you would find about 10,000 files. Later on, before we launch jobs, we want to "stage" the dataset, i.e. have the tape robots copy the files from tape onto the /pnfs drives. However, it is NOT a good idea to stage so much data at once. You will eat too many resources.

To prevent taking too many disk resources, what one should do is break the dataset definition into more managable chunks. Maybe a list of 1000 or so files. To do this, we can use this samweb command:

samweb create-definition [username]_[date]_[defname]_p00 "defname: [defname] with stride 10 offset 0"

This command creates a new definition, [username]_[date]_[defname]_p00, that consists of every 10th file from the original definition with offset 0. Note that the argument [username]_[date]_[defname]_p00 could be any name. But it is customary to put your username and date into a definition. The original [defname] is also provided in this case to remind one of where this came from. To make the other 9 of 10 defintions in order to complete the full set, change the offset from 1 through 9.

Now put this definition name into your project.py xml file in the <stage><inputdef> tag.

Choosing the fcl file

As noted earlier, you need to put into the <stage><fcl> the fcl file that will be run. This can be anything. For those interested in DL analysis specific files, refer to this page.

Stage your data set

With the data set definition defined, the fcl file chosen, and both indicated in the xml file, we are about ready to launch jobs. However, before one does this, we need to stage our dataset onto the /pnfs disks. We do this by

samweb prestage-dataset --defname=[definition name]

remember to use your small dataset here, not the big dataset.

You'll see a bunch of messages indicating that each file has been staged. After all have been completed, there will be a message indicating so.

You might get an error that says you do not have permission to make a dataset. Follow these commands to get access:

setup jobsub_client
kinit [username]
kx509
voms-proxy-init -noregen -rfc -voms fermilab:/fermilab/uboone/Role=Analysis

launch the first set of jobs

In the xml file, we have tags to indicate the number of jobs to launch and also how many input files are passed to each job. Each worker node has RAM and disk limits. If you wish to pass a lot of input files to the worker node, the disk limit has to be raised accordingly. To make book-keeping a little easier, we often just run one job per file. I like to keep the number of jobs launched to less than 500, in order to not hog the queue -- people's queue etiquette varies.

To launch jobs

 $> project.py --xml [xml file] --stage [stage] --submit

[xml file] is, of course, your project.py xml; [stage] is whatever stage you wish you wish to launch. In the above example, it is supera. Note that ordering of stages in the xml is important. You cannot launch a stage unless there are successful jobs from the previous stage. If you use samweb definitions as input, the completing of jobs are tracked (I believe) and good (larsoft) output files are passed on as input to the subsequent stage. (The author hardly ever runs multi-staged projects).

You can track your jobs using

 $> jobsub_q --user [username]

It is often useful to run this command at least once a few minutes after launching your job. This way you have a record of the job IDs launched. With those IDs, you can grab job logs which are useful in understanding what went wrong if the job returns with an error.

 $> jobsub_q --user [username] >& myjobs.txt

If the number of jobs submitted was less than that in the definition, one must first check the jobs output by running

 $> project.py --xml [xml file] --stage [stage] --check[ana]

the last argument is check if you are creating larsoft (art) files. It's checkana if you are making larlite or larcv files.

If you have more jobs to submit in your dataset definition run

  $> project.p --xml [project xml] --stage [project stage] --makeup

repeat this and the previous step until all the jobs are finished, or you are satisfied.

Grid hiccups

Jobs running on the grid don't always go smoothly. Here are some common issues

  • Sometimes you'll check the status of jobs and you'll find it is labeled "H" for "held". This usually comes about when a resource allocation limit has been reached and job is booted. Resource limits include running time, RAM usage, disk usage. For larcv conversion jobs, there is a know bug -- which has not been fixed -- that causes jobs to be held. Right now, we just remove the jobs $> jobsub_rm --jobid=[[email protected]] when it enters the held state.

  • Sometimes a job will return without logs shortly after being launched. This is often due to a bad build or bad fcl configuration. Make sure your job can run when called on the command line: lar -c [you fcl file] -s [test larsoft file]. If it can, then try clearing the build (instructions near the top), and remaking with mrb i -j4.

  • For whatever reason, sometimes the fcl file, seedservice.fcl cannot be found when running on the grid even though it is found when running on the command line. One fix is to copy it from the nutools package directory into your local products fcl directory

    cp $NUTOOLS_DIR/fcl/seedservice.fcl $MRB_TOP/localProducts_larsoft_v06_26_01_08_e10_prof/uboonecode/v06_26_01_09/job/
    

    note that the uboonecode version here is just an example and will be depend on the version you setup. Remember to re-tar your code (make_tar_uboone.sh) after copying the fcl file.

Probably not a good idea to kill the condor_dagman

If you have held jobs, it is tempting to remove all jobs at once using jobsub_rm --user=[username]. But it is better to remove held jobs one by one. This is because a crude job removal will also kill the job running condor_dagman which is the node orchestrating which input files go to which jobs. It keeps track of which input files have been "consumed" and when you run check[ana], will track if the consuming job was successful and relaunch if needed (I believe). However, if this job is killed, it won't properly talk to the samweb database to indicate which jobs have finished processing and you risk messing up your project.

Getting job logs to understand what kind of grid errors occurred

Logs are your best friend in trouble shooting errors. You can find logs in two places

  • the log folder as specified in the project.py xml file
  • logs requested from the submission system itself

The former is just the latter, automatically downloaded for you. But often, errors prevent the former from being requested. But if they were you can find them in the following places

  • log folder for each specific worker node/job
  • the project submission as a whole

However, you will often have to fetch logs

 jobsub_fetchlog --jobid=[jobid]
⚠️ **GitHub.com Fallback** ⚠️