Running uboonecode on fermigrid - twongjirad/LArLiteSoftCookBook GitHub Wiki
These steps are primarily for MicroBooNE.
These instructions assume you have checked out a copy of uboonecode. (Click here to read about how to perform this first step.) If you have made changes to the uboonecode source that you want to run, we also assume you have successfully been able to compile the code. Also, remember to have run mrb i -j4
to properly install the code into the localProducts
folder of your custom build. The binaries in this localProducts
folder is what will be packaged up, sent to the worker nodes, and run.
We use a tool called project.py
which manages the submission of the jobs and the subsequent tracking of a job's success. It is a python script which operates with the jobsub
tool, which is Fermilab's interface tool to condor -- which is the job submission system used by Fermigrid.
project.py
is configured using a project xml file. We have to give the xml file information you might expect:
- the location of the code to run (our packaged up
localProducts
folder) - the location of the output
- the input file lists (there are a couple of ways to define this)
- meta data for controlling job behavior and node requirements
For the input list, you have basically two choices:
- a text file containing the full paths of the files
- the more common choice: a filelist defined by a "
samweb
definition" samweb is a database tool to register/track files stored on the/pnfs
network drive system. As a network drive system (and a heavily used one at that -- it is servicing all Fermilab experiments as far as I know), you have to be a little careful about how it's used.
One important feature of /pnfs
, also known as "dcache", is that files can be stored on the Fermilab tape drive system. Many of MicroBooNE's official data and MC files are stored there. This fact is, in principle, suppose to be transparent to the user. If you access a file that is on tape, the system will ask a robot to grab the tape the data is on, and then copy it to some location on /pnfs
. But it's important to be mindful of this situation, because if you try to access files on tape in a poor manner, you will cause a giant bottleneck to the entire system. (e.g. asking for many individual transfers.)
To avoid bottleneck issues when we are trying to process a fairly large data sample, we will chop up large dataset definition, request to "pre-stage" the files as a block to /pnfs
, and then launch our jobs.
Note that /pnfs
is basically the only folders you should access on the worker nodes other than the /cvmfs
drives where the larsoft software packages can be accessed. In /pnfs
there are two locations we can use
/pnfs/uboone/persistant/users
-
/pnfs/uboone/scratch/users
As the name suggests,scratch
is a relatively temporary space. Files that have not been accessed for a couple of weeks will be deleted, leaving room for others to write to it. This is the place you should try to write your files if you know you only need them for a short time.persistant
is for more long term files. It is also a more limited resource, I believe. One must try to avoid storing unneeded files on persistant as much as possible.
Note: sometimes you'll see jobs end very early. There are many reasons for this and you'll have to grab the job logs to figure out what happened (more on this below). But one relatively common problem is that something is messed up in the localProducts build. One tip is to blow up the build and compile/install from scratch. To do this:
$> mrb z
$> cd [to the localProducts folder]
$> rm -r *
$> mrbsetenv
$> mrb i -j4
The first command deletes the build folder. The second/third command destroys the exisiting binaries in the localProducts
folder.
-
re-setup your uboonecode environment (if not already setup). test: does
which lar
find a binary? -
tar up your
localProducts
directory. From your uboonecode's top directory (to go there use:cd $MRB_TOP
)$> make_tar_uboone.sh larsoft.tar
Note that
larsoft.tar
can be anything you want. -
make a work directory. Often, I will make a directory in
/uboone/app/users/[username]/
-
in the work directory, we need to setup a project.py xml file
-
either identify a dataset definition you want to use or create one. If the data set is large, chop it up such that about 1000 files are in each subset
-
pre-stage the data set
$> samweb prestage-dataset --defname=[dataset definition]
-
launch the jobs
$> project.py --xml [project xml] --stage [project stage] --submit
-
wait. to check status of your jobs you can run:
$> jobsub_q --user [username]
(this is a good command to create an alias for. in my ~/.bash_profile I have
`alias qstat='jobsub_q --user [myusername]'`
-
when jobs are done, update the project status using
$> project.py --xml [project sml] --stage [project stage] --check[ana]
the last argument is
check
if you are creating larsoft (art) files. It'scheckana
if you are makinglarlite
orlarcv
files. -
if you have more jobs to submit in your dataset definition run
$> project.p --xml [project xml] --stage [project stage] --makeup
-
repeat this and the previous step until all the jobs are finished, or you are satisfied
We assume you've already have a copy of uboonecode, that it's built, and you have tested that your job runs fine. Make sure to test things first, so you don't waste time on the grid.
If you need files to test with check out:
/uboone/data/users/tmw/dl_test_files
You'll find a number of different types of files, from MC to data, EXTBNB to BNB events.
If the code is OK, we must make a tarball of the localProducts
folder. This will be shipped to the worker node, unpacked, setup, and the run. This is how you get your version to run, instead of using the tagged versions built and installed on /cvmfs
.
$> cd $MRB_TOP
$> make_tar_uboone.sh larsoft.tar
An example (as of Dec 18) of a project.py xml can be found here. This particular xml file is launching jobs that run the larsoft-to-larcv conversion routines, Supera.
You can get a list of tags and their definitions by typing:
$> project.py -xh
which dumps info that you can find here
Some key tags and notes in the example:
-
everything in this block is being used as variables for later in the script. Mostly to do with building file paths for output and log files.
<!DOCTYPE project [ ... ]>
-
when running events from an input def,
<numevents>1000000</numevents>
, isn't really being used -
the info in the
<larsoft>
block refers to the tarball made in the previous step -
all fcl files in
<fcldir>
will take priority over those in the default locations -
every worker node will run
lar -c [fcl file] -s [input]
.<stage><fcl>
determines the fcl file -
<inputdef>
determines the input. Here, this tells the project to get files from samweb defintions, which we will be described later. If using a textfile to list files, use<inputlist>
.
For now we are assuming we want to process some set of input files. Instructions will differ if you are trying to generate simulation files from scratch. (not covered here.)
One has two options:
- define an input data set by placing full paths into a text file. The files should be living on
/pnfs
- define an input data set using a samweb definition
The former is straightforward, so we will discuss the latter, which is more common.
Official data and MC data sets are listed on the Analysis Tools website.
Here you'll find a list of data sets, under the header "Data sets". As an example, let's look at the most current MC sample (at the time of this writing). Click here to link. You'll find a number of different datasets for different event samples. Also, the stage at which they have been processed is also indicated.
- Detsim covers the neutrino (or single particle) generation, particle propagation through the detector (geant4), and then the detector simulation which fills the TPC and photodetector responses.
- Reco1 is low level reco -- noise filtering, TPC waveform deconvolution, hit finding on the TPC, flash finding on the photodetector waveforms
- Reco2 is high-level reco -- track/shower finding, vertex proposals
- anatree -- here the info in larsoft files are translated into simple ntuples
- larcv (eventually) -- files store the event images
- larlite (eventually) -- files that store data in larlite format for use for DL analyses
Let's say we are interested in simulated neutrinos+cosmics past reco2, this is prodgenie_bnb_nu_cosmic_uboone
. We can get the samweb defintion by clicking describe
. If one does so, one will go to this page and will find, at the top, the data set defintion, listed. Here it's prodgenie_bnb_nu_cosmic_uboone_mcc8.4_reco2
.
One can go back to the terminal and use the samweb command line interface to learn more about the definition. Some useful commands:
-
to dump the files in the definition
samweb list-definition-files [defname]
-
to get the location of the file, grab one of the file names and use the following command
samweb locate-file [filename]
note that you often will not use the file location directly unless troubleshooting a specific file.
If you were to dump the file list in the definition and count them (samweb list-definition-files [defname] | wc -l
) you would find about 10,000 files. Later on, before we launch jobs, we want to "stage" the dataset, i.e. have the tape robots copy the files from tape onto the /pnfs
drives. However, it is NOT a good idea to stage so much data at once. You will eat too many resources.
To prevent taking too many disk resources, what one should do is break the dataset definition into more managable chunks. Maybe a list of 1000 or so files. To do this, we can use this samweb command:
samweb create-definition [username]_[date]_[defname]_p00 "defname: [defname] with stride 10 offset 0"
This command creates a new definition, [username]_[date]_[defname]_p00
, that consists of every 10th file from the original definition with offset 0. Note that the argument [username]_[date]_[defname]_p00
could be any name. But it is customary to put your username and date into a definition. The original [defname] is also provided in this case to remind one of where this came from. To make the other 9 of 10 defintions in order to complete the full set, change the offset from 1 through 9.
Now put this definition name into your project.py xml file in the <stage><inputdef>
tag.
As noted earlier, you need to put into the <stage><fcl>
the fcl file that will be run. This can be anything. For those interested in DL analysis specific files, refer to this page.
With the data set definition defined, the fcl file chosen, and both indicated in the xml file, we are about ready to launch jobs. However, before one does this, we need to stage our dataset onto the /pnfs
disks. We do this by
samweb prestage-dataset --defname=[definition name]
remember to use your small dataset here, not the big dataset.
You'll see a bunch of messages indicating that each file has been staged. After all have been completed, there will be a message indicating so.
You might get an error that says you do not have permission to make a dataset. Follow these commands to get access:
setup jobsub_client
kinit [username]
kx509
voms-proxy-init -noregen -rfc -voms fermilab:/fermilab/uboone/Role=Analysis
In the xml file, we have tags to indicate the number of jobs to launch and also how many input files are passed to each job. Each worker node has RAM and disk limits. If you wish to pass a lot of input files to the worker node, the disk limit has to be raised accordingly. To make book-keeping a little easier, we often just run one job per file. I like to keep the number of jobs launched to less than 500, in order to not hog the queue -- people's queue etiquette varies.
To launch jobs
$> project.py --xml [xml file] --stage [stage] --submit
[xml file]
is, of course, your project.py xml; [stage]
is whatever stage you wish you wish to launch. In the above example, it is supera
. Note that ordering of stages in the xml is important. You cannot launch a stage unless there are successful jobs from the previous stage. If you use samweb definitions as input, the completing of jobs are tracked (I believe) and good (larsoft) output files are passed on as input to the subsequent stage. (The author hardly ever runs multi-staged projects).
You can track your jobs using
$> jobsub_q --user [username]
It is often useful to run this command at least once a few minutes after launching your job. This way you have a record of the job IDs launched. With those IDs, you can grab job logs which are useful in understanding what went wrong if the job returns with an error.
$> jobsub_q --user [username] >& myjobs.txt
If the number of jobs submitted was less than that in the definition, one must first check the jobs output by running
$> project.py --xml [xml file] --stage [stage] --check[ana]
the last argument is check
if you are creating larsoft (art) files. It's checkana
if you are making larlite
or larcv
files.
If you have more jobs to submit in your dataset definition run
$> project.p --xml [project xml] --stage [project stage] --makeup
repeat this and the previous step until all the jobs are finished, or you are satisfied.
Jobs running on the grid don't always go smoothly. Here are some common issues
-
Sometimes you'll check the status of jobs and you'll find it is labeled "H" for "held". This usually comes about when a resource allocation limit has been reached and job is booted. Resource limits include running time, RAM usage, disk usage. For larcv conversion jobs, there is a know bug -- which has not been fixed -- that causes jobs to be held. Right now, we just remove the jobs
$> jobsub_rm --jobid=[[email protected]]
when it enters the held state. -
Sometimes a job will return without logs shortly after being launched. This is often due to a bad build or bad fcl configuration. Make sure your job can run when called on the command line:
lar -c [you fcl file] -s [test larsoft file]
. If it can, then try clearing the build (instructions near the top), and remaking withmrb i -j4
. -
For whatever reason, sometimes the fcl file,
seedservice.fcl
cannot be found when running on the grid even though it is found when running on the command line. One fix is to copy it from the nutools package directory into your local products fcl directorycp $NUTOOLS_DIR/fcl/seedservice.fcl $MRB_TOP/localProducts_larsoft_v06_26_01_08_e10_prof/uboonecode/v06_26_01_09/job/
note that the uboonecode version here is just an example and will be depend on the version you setup. Remember to re-tar your code (
make_tar_uboone.sh
) after copying the fcl file.
If you have held jobs, it is tempting to remove all jobs at once using jobsub_rm --user=[username]
. But it is better to remove held jobs one by one. This is because a crude job removal will also kill the job running condor_dagman
which is the node orchestrating which input files go to which jobs. It keeps track of which input files have been "consumed" and when you run check[ana]
, will track if the consuming job was successful and relaunch if needed (I believe). However, if this job is killed, it won't properly talk to the samweb database to indicate which jobs have finished processing and you risk messing up your project.
Logs are your best friend in trouble shooting errors. You can find logs in two places
- the log folder as specified in the project.py xml file
- logs requested from the submission system itself
The former is just the latter, automatically downloaded for you. But often, errors prevent the former from being requested. But if they were you can find them in the following places
- log folder for each specific worker node/job
- the project submission as a whole
However, you will often have to fetch logs
jobsub_fetchlog --jobid=[jobid]