How to Run Supera on uboone gpvm's and fermigrid (i.e. conversion of larlite to larcv files) - twongjirad/LArLiteSoftCookBook GitHub Wiki
Go here for instructions on how to setup the new Supera. Then use this to setup a project.py xml script and run on the grid.
Supera is the name of the program in LArCV that converts larlite data into the LArCV file format used in DL training and analysis.
This assumes you have setup the DL branch of larsoft (either v5 or v6) and have already run litemaker to produce larlite files on the grid. For more info on this, go here.
The steps are a bit involved and are as follows:
- make simlinks to the larlite files
- setup a process ID file which maps the job process id to a set of larlite files
- make a set of input filelists which provides the path to larlite files a job will run on
- copy the prcoess ID file to pnfs, so that it can be transferred to a worker node
- copy the input filelists to pnfs, so that they can be transferred to a worker node
- modify a bash script that your worker nodes on the grid will run
- setup your project.py file to run your bash script
It might be helpful to describe what we are trying to achieve here. The FermiGrid runs Condor, a software framework for managing jobs run over a cluster. project.py
, written by Herb, is the tool we use to submit larsoft jobs (i.e. run the executable lar
) on the grid. project.py
takes care of setting up the LArSoft environment we specify and for handling the transfer of input files, either specified by a text file list or through a samweb definition, to the work nodes for processing. Instead of figuring out how to talk to the FermiGrid directly, we're just going to hijack the project.py
routines to run our executable, supera
.
To do this, we take advantage of the fact that project.py
can run an initialization script before it calls lar
. This is specified by the <initscript>
tag in a project.py xml file. In this script, we will have the worker node
- download larlite files to run and then
- run supera.
And once it's done, we'll run a dummy larsoft program so that project.py
finished succesfully. At the end of the job, any output files will be transferred from the worker node back to where we specify in the project.py xml file, typically somewhere on /pnfs
.
Therefore, we need to tell the jobs which files to load and what the output file should be called. We also need to keep track of all the jobs that worked. That's the role the various scripts we'll call do.
To setup the input, we will use the link_maker.py
script to read in the "good" files from project.py
output files and logs, and make a directory where the files are simlinked with names that are numbered sequentially. This will be input files to process. Each file will be numbered by some FILEID, and it is through this number that we will keep track of what needs to be run and which worker node downloads which files.
How will we tell which files a given worker node should download? First we will collect all the FILEIDs that need to be run in a file. This is the jobs of the make_input_filelists.py
script.
For each worker node job, condor provides a process ID via the environment variable PROCESS
. We will use the PROCESS
variable to get the PROCESS
th line from the file ID list, which provides us a FILEID to process on that worker node.
After the worker nodes are done. We'll check which FILEID files were made succesfully. We update our FILEID list and submit more jobs. We repeat this until we are finished.
First, make sure you've setup LArCV on the gpvms. There are now versions of LArCV available on the gpvm's via UPS. As the time of this writing, the available versions are here:
> ups list -aK+ larcv
"larcv" "v00_01_00" "Linux64bit+2.6-2.12" "e9:prof" ""
"larcv" "v06_09_00" "Linux64bit+2.6-2.12" "e10:prof" ""
"larcv" "v05_09_00" "Linux64bit+2.6-2.12" "e9:prof" ""
If you setup the DL branch of ubooneboone code, larcv should also be setup in the process. To test it, try
> supera
***********************
RUN SUPERA
***********************
usage: supera [cfg file] [output file] [input 1] [input 2] [input 3] [input 4]
The following assumes you've set things up via the uboonecode DL branches:
remotes/origin/v05_08_00_01_dl_v00
remotes/origin/v06_11_00_dl_v01
In the fcl folder of the DL uboonecode branch, you'll find a python script srcs/uboonecode/fcl/deeplearn/dl_production_v00/link_maker.py
. The usage is:
> python link_maker.py
usage: link_maker.py [xmlfile] [folder where simlinks should be made]
special: if folder is simply kazu, links made at: '/uboone/data/users/kterao/dl_production_symlink_v00/%s' % UNIQUE_NAME
As the above says, provide the xml file you used to larlitify your larsoft files, and then provide a folder where simlinks to the output of your grid jobs should go. Note: you should put that folder somewhere on PNFS (/pnfs/uboone/scratch/users/
or /pnfs/uboone/persistent/users/
).
After you run it, you should find in the output folder, simlinks with ordered by a number. As an example:
> ifdh ls /pnfs/uboone/persistent/users/tmw/dl_thrumu/simlinks/mcc7_bnb_cosmic_v00_p00
...
/pnfs/fnal.gov/usr/uboone/persistent/users/tmw/dl_thrumu/simlinks/mcc7_bnb_cosmic_v00_p00/larlite_mcinfo_0018.root
/pnfs/fnal.gov/usr/uboone/persistent/users/tmw/dl_thrumu/simlinks/mcc7_bnb_cosmic_v00_p00/larlite_mcinfo_0019.root
...
/pnfs/fnal.gov/usr/uboone/persistent/users/tmw/dl_thrumu/simlinks/mcc7_bnb_cosmic_v00_p00/larlite_wire_0018.root
/pnfs/fnal.gov/usr/uboone/persistent/users/tmw/dl_thrumu/simlinks/mcc7_bnb_cosmic_v00_p00/larlite_wire_0019.root
...
This file will help the worker node map its process id to a set of larlite files. To start, you should have a certain number of files to get through. We need to make a text file with a sequence from 0 to N-1 ID numbers:
> seq 0 N > original_processid_set.txt
Open it, and you'll see a bunch of numbers in order:
> cat original_processid_set.txt
0
1
2
3
...
The basic idea is that we will map a process ID to a file ID. The process ID is the line number in the text file. In the above, process ID of 2
will map to file ID 2
, which means supera will process files that look like
larlite_opreco_0002.root
larlite_opdigit_0002.root
larlite_wire_0002.root
...
Before we run supera
we will need to transfer the set of larlite files to the worker node. We can do this through a txt file that has the path to the files we want. There is a script to build such file lists. In the DL branch, the folder srcs/uboonecode/fcl/deeplearn/gpvmsuperatools
, will contain the script, make_input_filelists.py
.
For now, Taritree was lazy and there are no command line inputs.
In the script, you need to define the following variables
variable | meaning |
---|---|
original_procs_file | original process ID file made in the previous step |
out_procfile | output process file where finished file IDs are removed |
tmp_inputlist_dir | folder where input files are written to |
larlite_simlink_dir | folder with larlite simlinks made in a previous step |
supera_outdir | where the supera output files (or simlinks to files) live. where script looks to see which file IDs have been finished. |
ismc | If MC or not. If MC, there are additional files we give to supera
|
Put it somewhere on PNFS. We will ask the worker node to copy it. For example:
ifdh cp supera_processids.txt [location on PNFS somewhere]
Put the folder of input files somewhere on PNFS. They will get copied to the worker nodes. Example:
ifdh cp -r jobfilelists /pnfs/uboone/persistent/users/tmw/dl_thrumu/jobfilelists/data_bnb_v00_p00
Note if you need to delete the directory in order to update the input file lists, try to use ifdh
commands. See here.
Examples for
- data:
srcs/uboonecode/fcl/deeplearn/gpvmsuperatools/supera_bnb_cosmic.cfg
- MC:
srcs/uboonecode/fcl/deeplearn/gpvmsuperatools/supera_mcc7_bnb_cosmic.cfg
In srcs/uboonecode/fcl/deeplearn/gpvmsuperatools
, make a copy of either run_grid_supera_example_mc.sh
or run_grid_supera_example_data.sh
depending on if your files are for data or MC. We'll have to make some modifications.
# process ID file
procfile=/pnfs/uboone/persistent/users/tmw/dl_thrumu/simlinks/procs.txt
# supera config file path
superacfgpath=/pnfs/uboone/persistent/users/tmw/supera_bnb_cosmic.cfg
# supera cfg file
superacfgname=supera_bnb_cosmic.cfg
# input file list directory
inputlistdir=/pnfs/uboone/persistent/users/tmw/dl_thrumu/jobfilelists/data_extbnb_v00_p00
Examples can be found at:
- data:
srcs/uboonecode/fcl/deeplearn/gpvmsuperatools/project_larv5_supera_extbnb_v00_p00.xml
- MC:
srcs/uboonecode/fcl/deeplearn/gpvmsuperatools/project_larv5_supera_mcc7_bnb_cosmic_v00_p00.xml
project.py --xml [xml file] --stage supera --submit
These are steps for resubmitting. Warning, it will get a little wonky as we have to work around some aspects of project.py
which wasn't build to do what we're doing.
project.py --xml [xml file] --stage supera --checkana
python link_maker.py [project.py xml file] [directory for your supera simlinks]
This will check if a supera file with a given FILEID was made AND if the file makes sense. You'll see something like:
>python make_input_filelists.py
proc file made: procs.txt
processes finished: 143
processes remaining: 150
This indicates that 150 jobs didn't return properly. The list of FILEIDs will be in procs.txt
or whatever file you told the make_input_filelists.py
script to output with your FILEIDs.
Remove your old process file and update it
ifdh rm /pnfs/uboone/persistent/users/tmw/dl_thrumu/simlinks/procs.txt
ifdh cp procs.txt /pnfs/uboone/persistent/users/tmw/dl_thrumu/simlinks/procs.txt
To be safe, we should use ifdh rm
to get rid of these files which live on /pnfs
.
cat badlist.txt | xargs -n 1 readlink | xargs -n 1 ifdh rm
Notes on what this command does:
-
cat
prints the contents of the badlist file to standard out -
xargs
captures all those entries and sends the output, one at a time (-n 1
) toreadlink
, which is a command to print the location of a simlink to standard out -
xargs
captures all of the simlink locations, and one at a time, sends the locations toifdh rm
you should see that it finds that the same number of files are missing. For example:
...
Adding layer two for path /pnfs/uboone/persistent/users/tmw/dl_thrumu/larv5_supera_mcc7_bnbcosmic/log/v05_08_00/events.list.
149 processes with errors.
150 missing files.
This way we do not launch more jobs than we need to.
(Note: of course change the path to your output folder for the examples below.)
There will be an important difference when you check (using checkana
) the second and later batches of jobs. When running in file generation mode, project.py
uses the process ID to label the jobs. This will cause it to think that you've duplicated files the second time around (and more). To get around this, go into the log folder and make the bad.list file empty.
ifdh rm /pnfs/uboone/persistent/users/tmw/dl_thrumu/larv5_supera_mcc7_bnbcosmic/log/v05_08_00/bad.list
touch bad.list
ifdh cp bad.list /pnfs/uboone/persistent/users/tmw/dl_thrumu/larv5_supera_mcc7_bnbcosmic/log/v05_08_00/bad.lis
This will prevent project.py
form deleting good runs just because the process ID is the same (but not the FILEID, which is our important label). You also need to tell it the number of missing jobs. Do this by copying the process list from make_input_filelist.py
to the missing.list
file.
ifdh rm /pnfs/uboone/persistent/users/tmw/dl_thrumu/larv5_supera_mcc7_bnbcosmic/log/v05_08_00/missing.list
ifdh cp procs.txt /pnfs/uboone/persistent/users/tmw/dl_thrumu/larv5_supera_mcc7_bnbcosmic/log/v05_08_00/missing.list
run --makeup
again. It might fail the first time. If so, just run the command again.
... or you find that there are some stubborn files that probably have problems.