Thought on possible improvements - selvaggi/mlpf GitHub Wiki

Training

Train again with the Hss data for very long, in order to use this data for training you need to change the limit_train_batches in the train_lightning1 to 3200*100*0.99/40 (number_of_filesnumber_events_per_filetrain-val-split)/(number_gpus*batch_size) python -m src.train_lightning1 --data-train /eos/experiment/fcc/ee/datasets/mlpf/CLD/train/011024_Hcard/pf_tree_{1..3200}.root --data-config config_files/config_hits_track_v2_noise.yaml -clust -clust_dim 3 --network-config src/models/wrapper/example_mode_gatr_noise.py --model-prefix /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/gun_drlog_v9_dr01/ --num-workers 0 --gpus 0,1,2,3 --batch-size 10 --start-lr 1e-3 --num-epochs 100 --optimizer ranger --fetch-step 0.01 --condensation --log-wandb --wandb-displayname drlog_alltracksbeta1 --wandb-projectname mlpf_debug --wandb-entity ml4hep --frac_cluster_loss 0 --qmin 3 --use-average-cc-pos 0.99 --tracks --train-val-split 0.99 --prefetch-factor 16

To test improvement check on eval dataset: python -m src.train_lightning1 --data-test /eos/experiment/fcc/ee/datasets/mlpf/CLD/train/011024_Hcard_eval/pf_tree_{4001..4200}.root --data-config config_files/config_hits_track_v4.yaml -clust -clust_dim 3 --network-config src/models/wrapper/example_mode_gatr_e.py --model-prefix /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/eval_comp_Hss/ --wandb-displayname eval_gun_drlog --num-workers 0 --gpus 3 --batch-size 5 --start-lr 1e-3 --num-epochs 100 --optimizer ranger --fetch-step 0.1 --condensation --log-wandb --wandb-projectname mlpf_debug_eval --wandb-entity fcc_ml --frac_cluster_loss 0 --qmin 1 --use-average-cc-pos 0.99 --lr-scheduler reduceplateau --tracks --correction --ec-model gatr-neutrals --regress-pos --add-track-chis --load-model-weights /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/E_cor_Hss/E_PID_02122024_dr05_s6500_3layer_pid_GTClusters_all_classes_PID_epoch0step4500.ckpt --freeze-clustering --predict --regress-unit-p --PID-4-class --n-layers-PID-head 3 --separate-PID-GATr and then run src.evaluation.refactor.plot_results.py

Add an objectness score to tell apart fakes, maybe the model we have without the fakes could already result in better performance? (the best performance achieved by possibly removing all fakes is shown here /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/eval_comp_05/gun_drlog_v9_99500_hbdscan__v3_130225_v1_400_15_12_005_RemovedFakes/ which reduced the mass res from 0.0529 to 0.0504 (which maybe it's not that much), this is tested with the plot_results.py script with --preprocess remove_fakes
Increasing the number of events, now the models are trained with ~300k events, I am trying a model with ~800k events, but the models they propose in MLPF that improve over the baseline start improving at around 500k and the original paper was trained with millions of samples. The results of the dr 05 dataset on the training with 800k are here: /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/eval_comp_05/gun_drlog_v9_99500_hbdscan__v3_130225_v1_400/ The results for the training with 300 (evaluated with /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/gun_drlog_v7/_epoch=5_step=42000.ckpt): /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/eval_comp_05/drlog_v7_5_42000_hdbscan_gun_05_130225_400_8_8_01 the mass resolution as well as the other metrics look very similar so not sure training with more data helped very much What seems to help is training for a longer time: /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/eval_comp_05/_gun_drlog_v9_dr01_4_50000_hdbscan_Hss_400_8_8_01/
Training with a dataset that is more collimated: also tried this by creating a gun with dr 0.1-0.2 and training together with data from dr 0.25-0.5 but the results also look very similar (eval on the 0.5 data): 0.25-0.5 /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/eval_comp_05/gun_drlog_v9_99500_hbdscan__v3_130225_v1_400/
0.1-0.2+0.25-0.5 /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/eval_comp_05/drlog_v9_01_31500_hdbscan_gun_05_130225_400_8_8_01/s
Increasing the model size: our model has 1.3 M parameters, the model from MLPF has 25 M parameters, it makes it harder to train but increasing the size should also make the model more expressive.
memory handling and so on.They also manage to train on 1 gpu with a batch size of 256 and 22M events in 80 h on a 80GB A100. This means 128 batch size on a 40 GB, we are currently at 20 events but we use hits. Their events have order 100 particles~ so 200 nodes per graph ~ 25600 nodes per batch.
Our events have order 6000 nodes per event ~ 60000 nodes per batch, so actually not that different and that is why they achieve the higher batch sizes. Still the model could be larger.
So 1 M events would take them about 8h for a full training.
while us it should take around 12 times more because our data is bigger, around 4 days (which is what it takes atm).
can we achieve the same performance with a transformer?

Evaluation

investigate working points for PID -> might also impact resolution of key variables

Plotting

define a set of plots to easily assess performance/ highlight critical outcomes
implement option to compare performance of two models