How to Run Supera on uboone gpvm's and fermigrid (i.e. conversion of larlite to larcv files) - twongjirad/LArLiteSoftCookBook GitHub Wiki

THESE INSTRUCTIONS ARE DEPRECATED

Go here for instructions on how to setup the new Supera. Then use this to setup a project.py xml script and run on the grid.

How to run LArCV's Supera on GPVMs

Supera is the name of the program in LArCV that converts larlite data into the LArCV file format used in DL training and analysis.

This assumes you have setup the DL branch of larsoft (either v5 or v6) and have already run litemaker to produce larlite files on the grid. For more info on this, go here.

The steps are a bit involved and are as follows:

  • make simlinks to the larlite files
  • setup a process ID file which maps the job process id to a set of larlite files
  • make a set of input filelists which provides the path to larlite files a job will run on
  • copy the prcoess ID file to pnfs, so that it can be transferred to a worker node
  • copy the input filelists to pnfs, so that they can be transferred to a worker node
  • modify a bash script that your worker nodes on the grid will run
  • setup your project.py file to run your bash script

A few intro notes

It might be helpful to describe what we are trying to achieve here. The FermiGrid runs Condor, a software framework for managing jobs run over a cluster. project.py, written by Herb, is the tool we use to submit larsoft jobs (i.e. run the executable lar) on the grid. project.py takes care of setting up the LArSoft environment we specify and for handling the transfer of input files, either specified by a text file list or through a samweb definition, to the work nodes for processing. Instead of figuring out how to talk to the FermiGrid directly, we're just going to hijack the project.py routines to run our executable, supera.

To do this, we take advantage of the fact that project.py can run an initialization script before it calls lar. This is specified by the <initscript> tag in a project.py xml file. In this script, we will have the worker node

  • download larlite files to run and then
  • run supera.

And once it's done, we'll run a dummy larsoft program so that project.py finished succesfully. At the end of the job, any output files will be transferred from the worker node back to where we specify in the project.py xml file, typically somewhere on /pnfs.

Therefore, we need to tell the jobs which files to load and what the output file should be called. We also need to keep track of all the jobs that worked. That's the role the various scripts we'll call do.

To setup the input, we will use the link_maker.py script to read in the "good" files from project.py output files and logs, and make a directory where the files are simlinked with names that are numbered sequentially. This will be input files to process. Each file will be numbered by some FILEID, and it is through this number that we will keep track of what needs to be run and which worker node downloads which files.

How will we tell which files a given worker node should download? First we will collect all the FILEIDs that need to be run in a file. This is the jobs of the make_input_filelists.py script.

For each worker node job, condor provides a process ID via the environment variable PROCESS. We will use the PROCESS variable to get the PROCESSth line from the file ID list, which provides us a FILEID to process on that worker node.

After the worker nodes are done. We'll check which FILEID files were made succesfully. We update our FILEID list and submit more jobs. We repeat this until we are finished.

Pre-reqs

First, make sure you've setup LArCV on the gpvms. There are now versions of LArCV available on the gpvm's via UPS. As the time of this writing, the available versions are here:

> ups list -aK+ larcv
"larcv" "v00_01_00" "Linux64bit+2.6-2.12" "e9:prof" "" 
"larcv" "v06_09_00" "Linux64bit+2.6-2.12" "e10:prof" "" 
"larcv" "v05_09_00" "Linux64bit+2.6-2.12" "e9:prof" "" 

If you setup the DL branch of ubooneboone code, larcv should also be setup in the process. To test it, try

> supera
***********************
RUN SUPERA
***********************
usage: supera [cfg file] [output file] [input 1] [input 2] [input 3] [input 4]

The following assumes you've set things up via the uboonecode DL branches:

 remotes/origin/v05_08_00_01_dl_v00
 remotes/origin/v06_11_00_dl_v01

Make simlinks

In the fcl folder of the DL uboonecode branch, you'll find a python script srcs/uboonecode/fcl/deeplearn/dl_production_v00/link_maker.py. The usage is:

> python link_maker.py
usage: link_maker.py [xmlfile] [folder where simlinks should be made]
special: if folder is simply kazu, links made at: '/uboone/data/users/kterao/dl_production_symlink_v00/%s' % UNIQUE_NAME

As the above says, provide the xml file you used to larlitify your larsoft files, and then provide a folder where simlinks to the output of your grid jobs should go. Note: you should put that folder somewhere on PNFS (/pnfs/uboone/scratch/users/ or /pnfs/uboone/persistent/users/).

After you run it, you should find in the output folder, simlinks with ordered by a number. As an example:

> ifdh ls /pnfs/uboone/persistent/users/tmw/dl_thrumu/simlinks/mcc7_bnb_cosmic_v00_p00
...
/pnfs/fnal.gov/usr/uboone/persistent/users/tmw/dl_thrumu/simlinks/mcc7_bnb_cosmic_v00_p00/larlite_mcinfo_0018.root
/pnfs/fnal.gov/usr/uboone/persistent/users/tmw/dl_thrumu/simlinks/mcc7_bnb_cosmic_v00_p00/larlite_mcinfo_0019.root
...
/pnfs/fnal.gov/usr/uboone/persistent/users/tmw/dl_thrumu/simlinks/mcc7_bnb_cosmic_v00_p00/larlite_wire_0018.root
/pnfs/fnal.gov/usr/uboone/persistent/users/tmw/dl_thrumu/simlinks/mcc7_bnb_cosmic_v00_p00/larlite_wire_0019.root
...

setup a process ID files

This file will help the worker node map its process id to a set of larlite files. To start, you should have a certain number of files to get through. We need to make a text file with a sequence from 0 to N-1 ID numbers:

> seq 0 N > original_processid_set.txt

Open it, and you'll see a bunch of numbers in order:

> cat original_processid_set.txt
0
1
2
3
...

The basic idea is that we will map a process ID to a file ID. The process ID is the line number in the text file. In the above, process ID of 2 will map to file ID 2, which means supera will process files that look like

larlite_opreco_0002.root
larlite_opdigit_0002.root
larlite_wire_0002.root
...

Define input filelists

Before we run supera we will need to transfer the set of larlite files to the worker node. We can do this through a txt file that has the path to the files we want. There is a script to build such file lists. In the DL branch, the folder srcs/uboonecode/fcl/deeplearn/gpvmsuperatools, will contain the script, make_input_filelists.py.

For now, Taritree was lazy and there are no command line inputs.

In the script, you need to define the following variables

variable meaning
original_procs_file original process ID file made in the previous step
out_procfile output process file where finished file IDs are removed
tmp_inputlist_dir folder where input files are written to
larlite_simlink_dir folder with larlite simlinks made in a previous step
supera_outdir where the supera output files (or simlinks to files) live. where script looks to see which file IDs have been finished.
ismc If MC or not. If MC, there are additional files we give to supera

Copy the process ID file to pnfs

Put it somewhere on PNFS. We will ask the worker node to copy it. For example:

 ifdh cp supera_processids.txt [location on PNFS somewhere]

Copy the input filelists to pnfs

Put the folder of input files somewhere on PNFS. They will get copied to the worker nodes. Example:

ifdh cp -r jobfilelists /pnfs/uboone/persistent/users/tmw/dl_thrumu/jobfilelists/data_bnb_v00_p00

Note if you need to delete the directory in order to update the input file lists, try to use ifdh commands. See here.

Copy the Supera config file you want to use to PNFS

Examples for

  • data: srcs/uboonecode/fcl/deeplearn/gpvmsuperatools/supera_bnb_cosmic.cfg
  • MC: srcs/uboonecode/fcl/deeplearn/gpvmsuperatools/supera_mcc7_bnb_cosmic.cfg

Modify a bash script that your worker nodes on the grid will run

In srcs/uboonecode/fcl/deeplearn/gpvmsuperatools, make a copy of either run_grid_supera_example_mc.sh or run_grid_supera_example_data.sh depending on if your files are for data or MC. We'll have to make some modifications.

set the path to your current process ID file

# process ID file                                                                                                                                                                                       
procfile=/pnfs/uboone/persistent/users/tmw/dl_thrumu/simlinks/procs.txt

set the path to your supera cfg file and the cfg file name

# supera config file path                                                                                                                                                                               
superacfgpath=/pnfs/uboone/persistent/users/tmw/supera_bnb_cosmic.cfg

# supera cfg file                                                                                                                                                                                       
superacfgname=supera_bnb_cosmic.cfg

set the path to you input file lists

# input file list directory                                                                                                                                                                             
inputlistdir=/pnfs/uboone/persistent/users/tmw/dl_thrumu/jobfilelists/data_extbnb_v00_p00

Setup your XML file

Examples can be found at:

  • data: srcs/uboonecode/fcl/deeplearn/gpvmsuperatools/project_larv5_supera_extbnb_v00_p00.xml
  • MC: srcs/uboonecode/fcl/deeplearn/gpvmsuperatools/project_larv5_supera_mcc7_bnb_cosmic_v00_p00.xml

Submit

project.py --xml [xml file] --stage supera --submit

When jobs done. Checking for failed jobs and resubmitting

These are steps for resubmitting. Warning, it will get a little wonky as we have to work around some aspects of project.py which wasn't build to do what we're doing.

Let project.py check if the job ran properly

project.py --xml [xml file] --stage supera --checkana

Make simlinks for the supera files

python link_maker.py [project.py xml file] [directory for your supera simlinks]

Run make_input_filelists.py

This will check if a supera file with a given FILEID was made AND if the file makes sense. You'll see something like:

>python make_input_filelists.py
proc file made:  procs.txt
processes finished:  143
processes remaining:  150

This indicates that 150 jobs didn't return properly. The list of FILEIDs will be in procs.txt or whatever file you told the make_input_filelists.py script to output with your FILEIDs.

Update the process file

Remove your old process file and update it

ifdh rm /pnfs/uboone/persistent/users/tmw/dl_thrumu/simlinks/procs.txt
ifdh cp procs.txt /pnfs/uboone/persistent/users/tmw/dl_thrumu/simlinks/procs.txt

Remove your bad files

To be safe, we should use ifdh rm to get rid of these files which live on /pnfs.

cat badlist.txt | xargs -n 1 readlink | xargs -n 1 ifdh rm

Notes on what this command does:

  • cat prints the contents of the badlist file to standard out
  • xargs captures all those entries and sends the output, one at a time (-n 1) to readlink, which is a command to print the location of a simlink to standard out
  • xargs captures all of the simlink locations, and one at a time, sends the locations to ifdh rm

Call --checkana again

you should see that it finds that the same number of files are missing. For example:

...
Adding layer two for path /pnfs/uboone/persistent/users/tmw/dl_thrumu/larv5_supera_mcc7_bnbcosmic/log/v05_08_00/events.list.
149 processes with errors.
150 missing files.

change the number of jobs in your project.py xml file

This way we do not launch more jobs than we need to.

Call --makeup

Repeat (mostly) this last section until all files OK

(Note: of course change the path to your output folder for the examples below.)

There will be an important difference when you check (using checkana) the second and later batches of jobs. When running in file generation mode, project.py uses the process ID to label the jobs. This will cause it to think that you've duplicated files the second time around (and more). To get around this, go into the log folder and make the bad.list file empty.

ifdh rm /pnfs/uboone/persistent/users/tmw/dl_thrumu/larv5_supera_mcc7_bnbcosmic/log/v05_08_00/bad.list
touch bad.list
ifdh cp bad.list /pnfs/uboone/persistent/users/tmw/dl_thrumu/larv5_supera_mcc7_bnbcosmic/log/v05_08_00/bad.lis

This will prevent project.py form deleting good runs just because the process ID is the same (but not the FILEID, which is our important label). You also need to tell it the number of missing jobs. Do this by copying the process list from make_input_filelist.py to the missing.list file.

ifdh rm /pnfs/uboone/persistent/users/tmw/dl_thrumu/larv5_supera_mcc7_bnbcosmic/log/v05_08_00/missing.list
ifdh cp procs.txt /pnfs/uboone/persistent/users/tmw/dl_thrumu/larv5_supera_mcc7_bnbcosmic/log/v05_08_00/missing.list

run --makeup again. It might fail the first time. If so, just run the command again.

... or you find that there are some stubborn files that probably have problems.

⚠️ **GitHub.com Fallback** ⚠️