Running for single btagging - dguest/btag2text GitHub Wiki

Running for single b-tagging

For a list of ongoing projects see this google doc.

Preprocessing for single b-tagging is much simpler than preprocessing for H->bb tagging. All you have to do is submit one job per dataset and wait for the result, which should take a few hours (maximum).

Note: all the scripts described here have a help flag (-h). Use it if you get confused!

Submitting the jobs

It should be sufficient to download the dataset you need, go to a clean directory, and run

btag-batch-submit.sh -s -r <path-to-directory-with-datasets>

Where the path should point to the directory containing all the dataset directories (i.e. the directories that look like group.perf.*_Akt4EMTo/). This will submit one job for each dataset. Relative to the directory you submit from, the output should be in ./runs/output/results/ while the log files should be in ./runs/output/logs/.

As it stands we write one HDF5 file per dataset. If this is a problem we could consider splitting it.

Examining the Output

First check the log files. In general cat runs/output/logs/stderr* should print nothing (there should be no errors) and cat runs/output/logs/stdout* should say something about submitting the job and end with "done".

Assuming the logs are fine, you can check the contents of the output file with h5ls -v runs/output/results/<something>.h5, where the <something> is the DSID This should print the contents of the file. There's also a tab complete script that you can source in your ~/.bashrc file to make this command a bit more friendly.

Producing Text Files

I'd recommend working with the HDF5 files directly. If text files are useful for some reason, though, you can produce them in the "one JSON entry per line" format that we've used in the past by running

./scripts/btag-dump-hdf2text.py <hdf5-file> --batch-size 7

By default this will dump all the variables for the first 7 jets. You can also dump a subset of the variables with the -t and -j flags, and list all the variables in the file using -d. Without the --batch-size argument it will dump for all jets.

This script might also serve as a useful example of how to read in the information in the dataset.

Running in batches

In practice you probably want to dump in smaller batches and gzip the output. Assuming your HDF5 file is called jets.h5, it would be something like this:

btag-dump-hdf2text.py jets.h5 --batch-size 2000 --offset ${BNUM} | gzip - > data${BNUM}.tgz

where ${BNUM} should be an integer which is should be incremented in your batch system. One way to do this is to write a shell script wrapper, something like

# print some information to keep track of what happened
echo "submit from $SLURM_SUBMIT_DIR, array index $SLURM_ARRAY_TASK_ID"

# move to the directory you submitted from
cd $SLURM_SUBMIT_DIR

# make a directory for the outputs (if there isn't one already)
mkdir -p outputs

BNUM=$SLURM_ARRAY_TASK_ID
btag-dump-hdf2text.py jets.h5 -b 2000 -o ${BNUM} | gzip -> outputs/data-${BNUM}.tgz

Assuming this is called run.sh, this can be submitted with

sbatch -a 1-10000 -p atlas_all -c 2 -t 1:00:00 run.sh

Here -a is to submit an array of jobs, which will run in parallel with different $SLURM_ARRAY_TASK_IDs assigned. This example will submit 10000 jobs numbered 1 to 10000. Note that the total events processed will be at most (n jobs) * (batch size) --- trying to process more will just launch a lot of jobs that end immediately. The -p argument specifies the batch queue, and -t gives the maximum run time (walltime). Note that a longer walltime will give your job more time to finish but will also lower its priority on the queue. The -c specifies how many cores to reserve. Technically we're only using one core, but we reserve two because of memory and IO limits on the nodes.

Once all the jobs have run you can use cat to combine them, i.e.

cat outputs/data-*.tgz > all-data.tgz

What are all these variables?

The output file contains two datasets: jets and tracks. The jets dataset contains one entry per jet, i.e. the format is suitable for simple feed-forward networks and other more "traditional" discriminants. The tracks dataset is 2d, with the first index referencing the jet number and the second referencing the track number, where tracks are sorted by absolute d0 significance (more on this later).

The track dataset is zero-padded out to 40 tracks (in practice we'll almost never have this many). Each track has a boolean variable called mask which is True if the track is padding.

The variables themselves are described below.

Jet Properties (in jets)

The jet properties described here are relevant to every jet we classify in some way. They describe the location of the jet in our detector:

  • pt: jet momentum transverse to the beam line
  • eta: "pseudorapidity": something like a momentum along the beam line

Truth Labels (in jets)

As mentioned above, the single b-tagging samples include both signal and background samples mixed into a common file. The way to distinguish them is by the "flavor label". There are several such variables:

  • LabDr_HadF: We use the cryptic name because there are a number of ways to assign this but as a rough rule use this as the label.
  • truthflav: older version of LabDr_HadF. The differences are minor, but don't use this one.

Note that neither of the truth flavor variables should be used in classification since this would be cheating and we don't have labels in real data.

There are several flavors of jets we're interested in, but the simplest case is to use a binary classifier where flavor == 5 is "signal" and everything else background. The more flexible approach is to create output node each flavor, which can be any of the labels in the table below:

label name abb typical fraction typical categorization
0 light jet u 65% background
4 charm jet c 5% background (unless charm tagging)
5 bottom jet b 26% signal (unless charm tagging)
15 tau jet tau 3% background (tau discriminants are more complicated)

After training a multi-class discriminant, one common approach has been to combine these outputs into a single discriminant using something like

discriminant = log( p_signal / sum_i (f_i * p_i) )

where the sum runs over all the backgrounds and the f_i factors can be adjusted according to the expected background in whatever data sample we may be interested in. Some examples are given in the table above but note that these are very dependant on other selection criteria that we apply to our data.

Top Level Variables (in jets)

The top level variables are the highest level of discriminant that we want to compare our final tagger with. At the moment there's only one of these:

  • mv2c10: This is a BDT which is based on all the "high level" variables described below.

NOTE: we do not want to use these as inputs except for sanity checks. We should be able to beat this with a sufficiently well tuned NN or BDT.

High Level Variables (in jets)

Some of these "high level" variables should be used as inputs, while there are others that we'd like to replace (with low level variables).

  • ip{2,3}d_p*: these are likelihoods from with a naive bayes classifier (IP3D or IP2D, the later isn't used much any more). In general they serve as a benchmark which we should be able to beat using track information. They can also be used as inputs for the top level algorithms.
  • ip3d_ntrk: number of tracks used by IP3D.
  • rnnip_p{u,c,b,tau}: recurrent network outputs from a neural network based on the IP3D inputs. This using only track variables we should be able to beat this. We call this network RNNIP.
  • mu_*: "soft" muon variables. Atlas has a specific subsystem just for muons, and muons can be useful to find b-hadrons. We should use these as inputs to a top level algorithm.
  • jf_*: these are high-level outputs from "JetFitter" which reconstructs several vertices in a b-hadron decay. These variables summarize things like invariant mass (m) and displacement of the reconstructed vertices, and the topology of the reconstructed decay.
  • sv1_*: these are similar to the JetFitter variables, but SV1 only fits a single secondary vertex with a simpler algorithm.

Low Level Variables (in tracks)

The low level variables are per-track. Some of them are used as inputs to IP3D and RNNIP. Note that "modeling" (how well simulation represents data) is an important consideration here. Variables which are badly modeled should either be avoided or we'll have to come up with a good way to mitigate the mismodeling.

  • pt: the track pt (momentum transverse to beam line). This is not used in IP3D or RNNIP over concerns about how well data describes this variable in simulation.
  • deta: the difference in eta between the track and the jet, i.e. (track eta) - (jet eta)
  • dphi: the difference in the phi between track and jet (phi is the azimuthal angle around the beam axis)
  • dr: hypot(dphi, deta). This is used as an input for RNNIP. Tends to be well modeled by simulation.
  • ptfrac: (track pt) / (jet pt). This is an input for RNNIP, but isn't particularly well modeled.
  • grade: the IP3D track "category". This is a way of summarizing a lot of track quality information into a single integer that runs between 0 and 13. Used in IP3D. Also used in RNNIP but isn't particularly well modeled.
  • d0: the amount by which the track "misses" the interaction point, in the transverse to beam axis direction. We're interested in this quantity because tracks that miss the interaction point are more likely to come from a displaced vertex.
  • z0: same as d0, but along the beam axis.
  • d0sig and z0sig: the significance of the d0 or z0, i.e. the parameter divided by the uncertainty in the parameter. For example, if d0sig = 1 the track is one sigma (1 standard deviation) away from hitting the collision point if we assume that the only the only source of variable spread is imprecisely measured tracks. Note that this variable spread only accounts for tracking precision: we have many tracks from displaced secondary vertices which are many sigma from the collision point. These are used in RNNIP and IP3D.
  • *_ls: "lifetime-signed" versions of the {d,z}0* variables. If the point of closest approach to the interaction point is in front of the interaction point with respect to the jet axis these are positive, otherwise they are negative. In general positively lifetime-signed tracks are more likely from a b-decay, but this signing convention may also wash out other information, since the non--lifetime-signed version of these also have a sign which is physically meaningful.
  • chi2: the chi2 for the track fit to the "hits" along the track. Bigger values mean a badly measured track.
  • ndf: number of degrees of freedom for the final track fit. ATLAS fits tracks using an iterative chi2 fitter where additional hits are masked off with each successive fit.
  • numberOf* and expect*: these are "track quality" variables which are summarized in grade. We probably don't need to use grade since these contain all the same information (and potentially more).
⚠️ **GitHub.com Fallback** ⚠️