Running for H bb Tagging - dguest/btag2text GitHub Wiki
Running for H->bb Tagging
Producing the training datasets is complicated by the need to reweight data samples, where the reweighting depends on some information about the input files. Specifically, we need to know how many events each file contains before we write out the final training data.
This means there's an additional xsec
(ATLAS jargon for cross-section) file that keeps track of the "metadata" which helps to calculate the weight we give to a given jet. You can look at this file in data/xsec.txt
, it contains records of the form:
RS_G_hh_bbbb_c10_M300 301488 1.3181E-03 1.0000E+00 79800.0
here the entries are: a "short name" to the dataset (which physicists usually understand), a "dataset ID" (a universal identifier), the physical cross-section for the process, the "filter efficiency" (don't worry about it), and the number of events stored in the local file. The last entry is something we have to calculate ourselves, see below.
Running over a subset of the files
Note that if files aren't listed in here, you won't be able to use them for training. This is because we need this metadata to properly reweight the samples. To run over only a subset of all the datasets, run the batch submit script in "test mode", which will build a list of inputs without submitting any jobs.
btag-batch-submit.sh -r inputs/ -t
Where inputs/
is the name of the directory where the ROOT files are stored.
This should produce a text file called root-files.txt
in your current directory. You can thin this out with standard unix utilities, i.e. grep
or awk
. For example, to select only RS graviton (signal) and dijet (background) samples, you can do
cat root-files.txt | egrep '(JZ.W|RS_G)' > root-files.txt.tmp
mv root-files.txt.tmp root-files.txt
Alternatively you can create a new directory which contains simlinks to the datasets you want to use and run the batch submission scripts on that.
Producing "Metadata"
Here metadata refers to the number of events in the file, and other information about data files. To collect this information we run a batch job that produces some diagnostic information alongside the event counts.
btag-batch-submit.sh -s btag-batch-run-fatdists.sh
This will take the existing list of root files and submit a batch job that produce histograms and metadata for all of them. Note that the outputs will be produced in the directory you call this script from, make sure you're currently on a disk with lots of space!
After these jobs have run, you should see a directory called runs/output/results
which will contain the output histograms. Now you want to collect the metadata stored in these files:
btag-metadata-collect.py <list of files to collect metadata from>
This will produce a file called meta.json
, which is just a lookup table mapping from the dataset ID to the information we need. To merge this with the table that we'll use to produce the final dataset, use the btag-metadata-to-xsec.py
script:
btag-metadata-to-xsec.py -m meta.json -x <path-to-xsec-file>
Collecting metadata is a bit labor intensive, so we check the "xsec" file into the main repository under data/
. This file is read by the dataset dumper to calculate the weight given to each jet.
Checking the output
First of all, you should check the log files to make sure there aren't any errors. The error logs are stored in runs/output/logs/stderr-*
, where the number in place of *
corresponds to a line number in the ROOT file list. Some errors seem to be harmless. To filter out only the harmful ones, you can use something like:
egrep -v "unknown branch" runs/output/logs/stderr-*
Making Diagnostic Plots
First you'll need to combine some of the HDF5 histogram files. First run
btag-hadd.py runs/output/results/*.h5 -d hists
this will combine all the histograms that are from the same physical process (given by the dataset ID). These files will be stored in hists/
. Next you can combine them for plotting, using
btag-merge-fatjet-hists.sh -i hists -o hists.h5
The combined histograms will be stored in hists.h5
. Note that if datasets are missing this will throw an error. To get around the error, you can use -f
.
Now you can produce plots from this file with btag-draw-fatjet-hists.py hists.h5
and btag-draw-images.py hists.h5
.
Dumping Training Data
Datasets are written to runs/output/results
. Note that if runs/output
already exists it will be moved to runs/output-X
where X
is the smallest integer which hasn't been taken.
We can dump training data as text or HDF5. We'd like to phase the text format out at some point.
For the text file format
Dumping the training data is now just a matter of running
btag-batch-submit.sh -s btag-batch-run-fatdump.sh -r inputs/
which is identical to the command to build metadata histograms, with the exception that fatdist
has been replaced with fatdump
.
For the HDF5 format
You can write the training data to HDF5 by running
btag-batch-submit-write.sh -r inputs/
this time no -s
is needed because there's only script that's run internally. The jobs are launched one per dataset (unlike the text file version of this command which launches multiple jobs per dataset) so some of the larger jobs will take a while (an hour or two).