Running the Analysis Chain (Using the Singularity containers) - twongjirad/LArLiteSoftCookBook GitHub Wiki

Running the Analysis Chain using the DLLEE Singularity Containers

These instructions are here to guide one through running any of the steps of the DLLEE Analysis Chain using Singularity containers. So far the systems that have Singularity setup to run is the Tufts cluster and the Mccaffrey workstation at Michigan.

These are the following steps

  • Supera: convert LArSoft to LArCV/larlite formats
  • Tagger
  • SSNet
  • Vertex
  • Tagger Ana Metrics
  • Vertex Ana Metrics

The goal is to have a set of events we will use for weekly integration tests. You are also free to use them to see how a change improves the analysis. We plan to have the following ~1K+ samples available for development

  • MC: BNB nue intrinsics with a 10 cm FV cut and a 1e1p final state cut
  • MC: BNB mumu events with a 10 cm FV cut, a 10 cm containment for the muon endpoint, and a 1mu1p final state cut
  • MC: corsika cosmic only events
  • MC: BNB intrinsic nue only (i.e. no cosmics) using the same cuts as the above BNB nue intrinsics sample
  • MC: NCpi0 sample
  • MC: full BNB cocktail sample
  • DATA: EXT-BNB events

Making a container

In order to make a singularity container you will need a computer where you can run sudo commands in linux. This will usually mean your laptop. For Mac OSX users, you probably have to install a VM, e.g. vagrant.

You can follow the installation instructions for Singularity here.

Instructions for OS X here.WARNING the instructions uses homebrew to install some dependencies. If you use macports, do not install brew as well. This can cause your system to become unstable if you are not careful.

DLLEE Unified container: for tagger, vertex, vertex analysis, tagger analysis stages

To make the container, first clone the repo containing the boostrap file (i.e. the file with instructions for Singularity on how to build the container). Then go into the folder.

git clone https://github.com/twongjirad/singularity-dllee-ubuntu
cd singularity-dllee-ubuntu

Make an empty image with an allocated space of 6 GB. The last argument is the name of the image. You can call it whatever you want. (One convention is to keep the date at the end of the filename for bookkeeping purposes.)

singularity create --size 6000 singularity-dllee-unified-070417.img

Next, build the image using the bootstrap file

sudo singularity bootstrap singularity-dllee-unified-070417.img Singularity

Now we have to go into the container and copy into it the MicroBooNE photon library data. First, place the file in /tmp (make that folder on your computer or VM if you don't have it). If you don't have a copy of the library, just ask someone where you can get one.

cp uboone_photon_library_v6_efield.root /tmp/

We copy the file to this location because when we load the image, this folder will be mounted in the container. So, now go into the container. You'll first get a prompt that looks like Singularity>. Type bash to start a bash shell. These commands go something like this:

sudo singularity shell --writable singularity-dllee-unified-070417.img
Singularity> bash

Now copy the photon library into the right place.

cp /tmp/uboone_photon_library_v6_efield.root /usr/local/share/dllee_unified/larlite/UserDev/SelectionTool/OpT0Finder/PhotonLibrary/dat/

Finally, exit the container. You'll have to type exit twice (he first time to leave the container's bash shell and the second time to leave the container).

That's it. Now transfer the container to either Tufts or Michigan.

SSNet container: for ssnet stage

Instructions later. Don't anticipate this changing much.

SUPERA

Instructions for this is on the main page of this wiki. You can follow this link.

COSMIC TAGGER/CROI SELECTION

On Tufts

Go to your home directory. This directory will look something like this

/cluster/home/username

I'll distinguish this area, which can only hold about 3 GB from the wongjiradlab space, which is 10s of TB in size and is located at

 /cluster/tufts/wongjiradlab

or the same place at:

/cluster/kappa/90-days-archive/wongjiradlab

Note that the former, /cluster/tufts cannot be seen from the containers. So it is better, when referring to this folder in job scripts, to use the latter, /cluster/kappa.

Because the /home directories are so limited in capacity, it is recommended to create a workspace in the wongjiradlab folders. Just go to

/cluster/tufts/wongjiradlab

and make a folder with your username, for example

mkdir /cluster/tufts/wongjiradlab/obama

Go into your working folder and clone the scripts repository for the tagger

cd /cluster/tufts/wongjiradlab/obama
git clone https://github.com/twongjirad/dllee-tufts-slurm.git

Go into this folder. The first thing to do is to prepare the list of jobs to run. To do so, go into make_inputlists.py and choose the sample you want to run by uncommented to variables that indicate where the source files are. Edit the location of the larcv or larlite files if needed. Or if you made your own, point to the location of your files. For example, this will look like

...
# MCC8.1 Samples                                                                                                                                                                                            
# --------------                                                                                                                                                                                            

# MCC8.1 nue+MC cosmic: Tufts                                                                                                                                                                               
#LARCV_SOURCE="/cluster/kappa/90-days-archive/wongjiradlab/larbys/data/mcc8.1/nue_1eNpfiltered/supera"                                                                                                      
#LARLITE_SOURCE="/cluster/kappa/90-days-archive/wongjiradlab/larbys/data/mcc8.1/nue_1eNpfiltered/larlite"                                                                                                   

# MCC8.1 nue+MC cosmics: mccaffrey                                                                                                                                                                          
#LARCV_SOURCE="/home/taritree/larbys/data/mcc8.1/nue_1eNpfiltered/supera"                                                                                                                                   
#LARLITE_SOURCE="/home/taritree/larbys/data/mcc8.1/nue_1eNpfiltered/larlite"                                                                                                                                

# MCC8.1 nue-only: Tufts                                                                                                                                                                                    
#LARCV_SOURCE="/cluster/kappa/90-days-archive/wongjiradlab/larbys/data/mcc8.1/nue_nocosmic_1eNpfiltered/supera"                                                                                             
#LARLITE_SOURCE="/cluster/kappa/90-days-archive/wongjiradlab/larbys/data/mcc8.1/nue_nocosmic_1eNpfiltered/larlite"                                                                                          

# MCC8.1 numu+cosmic: Tufts                                                                                                                                                                                 
LARCV_SOURCE="/cluster/kappa/90-days-archive/wongjiradlab/larbys/data/mcc8.1/numu_1muNpfiltered/supera"
LARLITE_SOURCE="/cluster/kappa/90-days-archive/wongjiradlab/larbys/data/mcc8.1/numu_1muNpfiltered/larlite"

# MCC8.1 NCpi0: Tufts                                                                                                                                                                                       
# NOT MADE YET                                                                                                                                                                                              

# MCC8.1 Cocktail: Tufts                                                                                                                                                                                    
# NOT MADE YET  
...        

Here, we're going to run the numu+cosmic sample on the Tufts cluster.

Now, make the various input files by running

python make_inputlists.py

This will make two things. One is a folder with a bunch of text files. They are given names based on a FileID number that has been arbitrarily assigned to each file. Inside these text files are the paths to either the larcv or larlite files. The other file that is made is jobidlist.txt. This has a list of the different FileIDs. Make a copy of this and call it rerunlist.txt.

cp jobidlist.txt rerunlist.txt

As we process events, we'll remove finished FileIDs from rerunlist.txt until it is empty. But we need the jobidlist.txt file in tact in order to know how many FileIDs are left to run.

Now we have to modify the submission script to point to the right locations. Open or create, submit.sh. You'll see the following

#!/bin/bash                                                                                                                                                                                                 
#                                                                                                                                                                                                           
#SBATCH --job-name=tagger_numu                                                                                                                                                                              
#SBATCH --output=tagger_numu_log.txt                                                                                                                                                                        
#SBATCH --ntasks=100                                                                                                                                                                                        
#SBATCH --time=4:00:00                                                                                                                                                                                      
#SBATCH --mem-per-cpu=4000                                                                                                                                                                                  

WORKDIR=/cluster/kappa/90-days-archive/wongjiradlab/grid_jobs/dllee-tufts-slurm
CONTAINER=/cluster/kappa/90-days-archive/wongjiradlab/larbys/images/dllee_unified/singularity-dllee-unified-071017.img
CONFIG=${WORKDIR}/tagger.cfg
INPUTLISTDIR=${WORKDIR}/inputlists
OUTPUTDIR=/cluster/kappa/90-days-archive/wongjiradlab/larbys/data/mcc8.1/numu_1muNpfiltered/out_week071017/tagger                                                                                          
#OUTPUTDIR=/cluster/kappa/90-days-archive/wongjiradlab/larbys/data/mcc8.1/nue_1eNpfiltered/out_week071017/tagger
#OUTPUTDIR=/cluster/kappa/90-days-archive/wongjiradlab/larbys/data/mcc8.1/corsika_mc2/out_week071017/tagger                                                                                                 
JOBIDLIST=${WORKDIR}/rerunlist.txt

module load singularity

srun singularity exec ${CONTAINER} bash -c "cd ${WORKDIR} && source run_job.sh ${CONFIG} ${INPUTLISTDIR} ${OUTPUTDIR} ${JOBIDLIST}"

Note that the lines with #SBATCH are special. These are arguments to the batch submission system, slurm.

You'll most likely have to change the location of stored in WORKDIR. This should be the location the repository you cloned. The other thing to change is the location where the output files should go, OUTPUTDIR. Take care not to put your files in the wrong location, or overwrite other peoples files. If you updated the container or want to run your own, modify CONTAINER.

Note that the tagger configuration is slightly different for MC and data. In the submit.sh file you need to pick between them. (Example below sets the data.)

# For MC
#CONFIG=${WORKDIR}/tagger.cfg
# For Data
CONFIG=${WORKDIR}/tagger_data.cfg

The parameters for the slurm configuration

parameter what it does
output job messages go into this log file. name it whatever you want.
ntasks this is the number of jobs to run. one FileID per job will be processed.
time the maximum time the job should run in HH:MM:SS format. At the end of this time, the job will be killed. But the lower the number, the better priority I think. Tagger jobs take about 20-30 seconds per event.
mem-per-cpu the amount of RAM requested per node (in MB)

Once this is all correct, you can submit the job. Do this by running,

sbatch submit.sh

To see information about your jobs, type

squeue --u [username]

For example, you might see something like this

          JOBID    PARTITION    NAME   USER     ST       TIME  NODES NODELIST(REASON)
          14973370     batch    tagger twongj01  R      34:08     10 alpha[008-017]

One thing this command will show (in the leftmost column) is the ID number for this job.

To kill your jobs

scancel [job ID number]

Once the job finishes (squeue returns nothing for your username), you should do a basic check to see if all the files were made. First, edit the script, check_jobs.py, by setting TAGGER_FOLDER to point to the output folder. Then, run

./singularity_check_jobs.sh

The script will return how many jobs are in total, how many of those launched needed to be rerun, and the number of jobs that remain to be run. Once the latter two numbers are zero, you can proceed to the next step.

On McCaffrey

The same idea as running on the Tufts cluster, but here we're running on a single, many-cored workstation.

Go into the make_inputlist.py file and choose the paths for your sample, but for the mccaffrey system.

Make the input files in the same as above.

Now edit, run_mccaffrey_job.sh. Change the values of the various variables.

You should now open a screen. This way you can leave your ssh session and the jobs won't quit. Launch a bunch of jobs using:

run_mccaffrey_job.sh N

where N is the number of jobs. In principle, it's that easy. However, there have been some problems with permissions in the past. An attempt to fix this has been made. Make sure that your account is part of the larbys group. If you're not, or you're having trouble, contact TMW.

SSNET

The basic structure of the scripts is similar to the tagger for SSNet (and the rest of the steps).

On Tufts

First, SSNet produces a number of large temporary files. The home folders on the Tufts cluster are not very large, only about 3 GB. One should make a space in the common area, /cluster/tufts/wongjiradlab, and clone the SSNet scripts repo into it using

git clone https://github.com/LArbys/ssnet-tuftscluster-scripts.git

Go into the folder.

There you should edit the file make_inputlist.py and point to the source files, which is now the tagger output. Then run the script:

python make_inputlists.py

When you run the python script, the same types of products emerge: a folder, inputlists, that contain the input files to ssnet for each file and a jobidlist.txt with the list FileIDs for all the tagger files. As with the tagger, make a copy of jobidlist.txt and call it rerunlist.txt

Next, change the job submission script, submit.sh, to point to the right elements. Note that if you prepared your own container using the above instructions, it will not by default contain the information necessary to run ssnet. So for running ssnet keep the singularity image location shown below, change only your workdir, inputlistdir,outputdir, and jobidlist:

#!/bin/bash                                                                                                                                                                                                 
#                                                                                                                                                                                                           
#SBATCH --job-name=ssnet                                                                                                                                                                                    
#SBATCH --output=ssnet_log.txt                                                                                                                                                                              
#                                                                                                                                                                                                           
#SBATCH --ntasks=100                                                                                                                                                                                        
#SBATCH --time=8:00:00                                                                                                                                                                                      
#SBATCH --mem-per-cpu=4000                                                                                                                                                                                  

CONTAINER=/cluster/kappa/90-days-archive/wongjiradlab/larbys/images/singularity-dllee-ubuntu/singularity-dllee-ssnet.img
WORKDIR=/cluster/kappa/90-days-archive/wongjiradlab/grid_jobs/ssnet-tuftscluster-scripts
INPUTLISTDIR=${WORKDIR}/inputlists
JOBLIST=${WORKDIR}/rerunlist.txt

OUTDIR=/cluster/kappa/90-days-archive/wongjiradlab/larbys/data/mcc8.1/numu_1muNpfiltered/out_week071017/ssnet                                                                                              
#OUTDIR=/cluster/kappa/90-days-archive/wongjiradlab/larbys/data/mcc8.1/nue_1eNpfiltered/out_week071017/ssnet

module load singularity
srun singularity exec ${CONTAINER} bash -c "cd ${WORKDIR} && source run_job.sh ${WORKDIR} ${INPUTLISTDIR} ${OUTDIR} ${JOBLIST}"

Remember to set the number of jobs to run by adjust the parameter to the argument --ntasks=N, and the time allotted for each job, --time=HH:MM:SS. SSNet is the slowest step by far. One rule of thumb is that 60 events takes a little under 2 hours. I typically set things to 5 or 8 hours for such a sample, just in case there are a number of events where a whole image is covered in cROIs.

With submit.sh set, run the job using

sbatch submit.sh

When the jobs are done, check them by pointing the script, check_jobs.py, to look into the ssnet output folder. Then run

./singularity_check_jobs.sh

Like in the tagger, you want to iterate until the number of remaining jobs is 0.

On McCaffrey

[To do.]

Vertex

The pattern of the steps will be very familiar to you by now.

On Tufts

Clone the scripts repository somewhere in your wongjiradlab user folder.

git clone https://github.com/LArbys/ssnet-tuftscluster-scripts.git

Update the make_inputlist.py file to generate input lists and a job ID list text file.

Edit, submit.sh to point to the right container, work directory, and output directory.

Run using

 sbatch submit.sh

Check the output using

 ./singularity_check_jobs.py

Repeat until all the files are processed.

On McCaffrey

[to do]

Tagger Analysis Metrics

Vertex Efficiency Metrics