Reconstruction & lessons learned from Retro - philippeller/freeDOM GitHub Wiki

Priors

A general implementation of priors encompasses the concept of enforcing boundaries on the parameter space the optimizer explores
- Assume the optimizer operates on an N-dimensional hypercube, and each dimension's domain is [0, 1]
- Any practical "shape" of boundaries can be parameterized by this
Can be defined relative to / based upon an existing (or a to-be-done) reconstruction of the event
Can be absolute
Should make easy to construct arbitrary prior shapes based on MC studies
Prior transformation allows changing the space, e.g. energy is sampled in log-space (optimizer always works in linear [0, 1], but a prior transform can map that onto a logarithmic space in, say, [0.1, 1000] GeV)
Possibly want to make prior transformation Numba gufuncs, as these can operate just as well on 1 set of params or N sets of params (in the batch-LLH-request case), and can be called without overhead from other numba nopython code (and, in turn, nopython functions can be told to not hold the GIL)

Optimization

Make it easy to swap out different optimizers
- Problem is that almost all take different inputs, produce different outputs and metadata
1-LLH-at-a-time optimizers
- crs
- multinest
- scipy
- nlopt
- pyminuit
- etc.
Batches-of-LLH's-at-a-time optimizers
- gridsearch
- pseudo-/quasi-random sampling
- simulated annealing
- genetic algorithms
- etc.
Make it possible to nest optimizers -- divide parameter space up into nested optimization problems. E.g., outer optimizer chooses parameters (p0, p1). Inner optimizer then chooses (p2, p3) with (p0, p1) fixed to values picked by outer optimizer.. Inner inner optimizer then chooses p4 with (p0, p1, p2, p3) set to values chose by other optimizers, etc.
- Generalizes the idea of Pegleg-ing track segments to reduce dimensionality "load" on any one optimizer.
- Note that justification for Pegleg-ing is not just dimensionality reduction, but also its speedup because it reuses what's already been computed from previous track segments; this won't be the case for freeDOM
All of this has to be quite fast in order to get the most out of our fast LLH evaluation!
- Numba now has a context manager which allows for switching to run object-mode code within otherwise nopython code. I haven't used this before, but it might make it possible to write "fast" code that is also able to use arbitrary optimizers with pure-Python interfaces...
We may evaluate tons of LLH's, so keeping all of them might be a memory hog, plus having too many (useless) points makes post-processing take longer
- Define the (LLH+parameters)-storage pruning strategy up front
  - keep all points
  - keep at most N best LLH's
  - keep only LLH's within X LLH-units of peak LLH
  - other strategies?
We have a global LLH offset to keep LLH's negative (otherwise things break). We can define such a thing up front, as a parameter we can supply for any reco.

Analysis of a fit

Note that time spent doing this post-processing can be quite significant, and might actually dominate! Generally, the more carefully-done / more-useful the analysis, the longer it takes to run.
- If we're using GPU's, can we get the GPU to store the (llh, parameters) points and perform posterior analysis there?
Max-LLH value found (this should apply for all optimizers)
Average LLH of final set of "live" points (or best-N points) for a reco (applies to e.g., MultiNest, CRS)
- Expect this to be more robust than max-LLH, provide better / less noisy data/MC and reco1(random seed x) vs. reco2(random seed y) type comparisons
Analyze distribution of (parameters, LLH)-space
- Mean, median, lower bound, upper bound in each dimension (or treat connected spherical dimensions simultaneously) of all points within some Delta-LLH of max-LLH
- Parabaloid fit?
For some optimizers, analyzing the distribution of LLH's is not useful at all
Some optimizers return covariance matrices, Hessians, etc. which we should keep for each event
- Some of these are useful, some apparently are less so... but we should keep them around in either case
E.g. MultiNest can return some of its own posterior analysis stats, but has still been useful to do our own

Input, output data and storage

via i3cols, easy to divide up events arbitrarily and populate the same output array from multiple processes (so long as the output array isn't created ~simultaneously by multiple workers)
Extract events from i3 files once, and then we can rsync only the data needed for performing recos (any existing reco(s) that priors are based upon, the pulse series used, ...?)
Define numpy dtypes of all things that will be written ahead of time (except e.g. JSON outputs), such that properly memory-mapped array(s) can be initialized on disk for a set of events at the outset
- "Global" metadata about reco (i.e., information that pertains to all reconstructed events in the column)
  - What is the "generic" priors specification used?
    - e.g.: "shape A and relative to X reco if it is present, else shape B relative to Y reco if it is present, else shape C defined in absolute coordinates"
  - What optimizer was used?
  - What "global" settings were used for that optimizer (specify even "defaults," as sometimes those are changed over time)
  - What are all relevant software versions used?
    - Python
    - numpy
    - numba
    - llvmlite
    - freeDOM
    - optimizer
  - What host(s) were recos run on?
    - What CPU make/model is used?
    - OS
  - What "global" LLH setup is used?
    - Identifier for how LLH is obtained
      - Leave it general enough that we could say "tables"?
      - Which NN model(s) were used?
        
        Is there a good way to summarize these?
        
        Can we start building a "library" of models, so we can uniquely identify them (and not just rely on file names)?
      - LLH's evaluated on CPU(s) or GPU(s)?
      - What CPU architecture? (always record)
      - What GPU architecture(s), CUDA versions, etc? (only record if running NN's on GPU)
      - It's possible that some events will be reco'd with GPU and others CPU, which seems onerous to record per-event, but until we know the results are identical for both, we might want to make this stored per-event
      - etc.
- Per-event metadata
  - Number of LLH evaluations
  - Optimizer run time
  - Optimizer-specific metadata
    - Number of steps taken
    - Convergence criteria
    - ...
- Posterior analysis / fit information
  - some will be optimizer specific
  - See discussion above
- i3cols does not yet support--but should support--writing metadata about a column. This would allow everything that's common about how the reco was set up and run to be recorded just once, and not repeated in the fields of each event's reco (which would be extremely inefficient and too slow)
  - There should always be universal metadata that applies to all events
  - There might also be some metadata that applies to some subset of the events, and metadata that applies to a bunch of other events. How should we handle this?