Preparing config file - jniedzie/SVJanalysis_wiki GitHub Wiki

The information about training and evaluation of the ML models is stored in a python config file. In general you only need to make a copy of the default config you need to use. You can see all existing config files here. The code inside of the config should be self-explanatory.

Here is a detailed description of the config file:

TODO!

Inside of the config you will find settings that are mandatory and are used by the general Trainer and Evaluator classes, as well as some additional settings which will be passed to your specialized Trainer/Evaluator class. Below you'll find an explanation of the most important parts of the config file.

Some models are trained on both background and signal, some on background data only. In order for the training/evaluation scripts to build correct input and output paths, you should set this flag to true if using signals for training:

train_on_signal = True

Next, we need to specify a few general settings for the Trainer class, which don't depend on the architecture. Don't change the name of this variable nor keys. Point to the specialized Trainer class you want to use, specify validation and test data fractions, whether or not jets' high level features and EFP variables should be used and which high level features should not be included in the training:

training_general_settings = {
    "model_trainer_path": "module/architectures/TrainerBdt.py",
    "validation_data_fraction": 0.0,
    "test_data_fraction": 0.2,
}

Select number of models to be trained when the training script is ran:

n_models = 100

Set all the input and output paths. output_path is the base directory for the results, summary_path is where text files with information about each training run will be stored, results_path is where models and model weights will be stored (and any other output from the training, such as loss in each epoch if corresponding callback is specified for the model), AUCs_path will be used to store calculated areas under ROC curves and finally stat_hists_path is where the ROOT file containing histograms used for the statistical analysis will be put:

qcd_path = "../../data/backgrounds/qcd/h5_no_lepton_veto_fat_jets_dr0p8/base_{}/*.h5".format(efp_base)
signals_base_path = "../../data/s_channel_delphes/h5_no_lepton_veto_fat_jets_dr0p8/"

output_path = "trainingResults_test/bdt/"
summary_path = output_path+"summary/"
results_path = output_path+"trainingRuns/"
AUCs_path = output_path+"aucs/"
stat_hists_path = output_path+"stat_hists.root"

Training parameters is a dictionary that should contain any parameters used to build/compile/fit the model.

training_params = {
    "algorithm": "SAMME",
    "n_estimators": 800,
    "learning_rate": 0.5,
}

Use norm_type field to specify which scaler should be used to normalize input data. You will find possible options and parameters of different scalers (which can be modified) below norm_type field, in the normalizations dict:

norm_type="StandardScaler"
normalizations = {...}

Next, specify base output file name. In case several models are trained, version number will be added automatically, but it may be useful to include some information about the training type in the name of the output file, for instance the EFP base in case you want to compare results for different bases. Alternatively, you can also just store them in different directories (see output_path above):

file_name = "hlf_eflow_{}".format(efp_base)

The training_settings dict will be passed (and unwrapped) to your specialized Trainer class, so put here any information that you want to use for the loading and normalizing the data, building and fitting the model. Some values will be added to this dict in the training/evaluation scripts, e.g. path to the signal in case you specified you want to use it:

training_settings = {
    "qcd_path": qcd_path,
    "training_params": training_params,
    "EFP_base": efp_base,
    "norm_type": norm_type,
    "norm_args": normalizations[norm_type],
}