Training and models - selvaggi/mlpf GitHub Wiki

Model checkpoints

Clustering only

Trained on dataset Notes Checkpoint **Wandb link **
dR=0.5 /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/gun_dr_025_v3/_epoch=0_step=6500.ckpt
gun_dr_log_logE_v0_061224 BEST so far (current eval for the note) /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/gun_drlog_v7/_epoch=5_step=42000.ckpt very similar to /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/gun_drlog_v9_dr01/ https://wandb.ai/ml4hep/mlpf_debug/runs/3vs03stt/overview
Hss model evaluation was done on /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/181024_Hss/ https://wandb.ai/ml4hep/mlpf_debug/runs/imme1iwf/overview

Clustering from above, training of energy correction + PID

Trained on dataset Notes Checkpoint Wandb link
dR=0.5 GT clusters https://wandb.ai/fcc_ml/mlpf_debug/runs/8tx8cprl
dR=0.5 https://wandb.ai/fcc_ml/mlpf_debug/runs/bwfdy7sq?nw=nwusergregorkrz
Hss /eos/user/g/gkrzmanc/results/2024/051124_fixPID_train_E_PID_Hss_clustering_CONT2/_epoch=2_step=1500.ckpt

Training guidelines

Training of clustering

Trains with 320k events python -m src.train_lightning1 --data-train /eos/experiment/fcc/ee/datasets/mlpf/CLD/train/gun_dr_050_v1_271124/pf_tree_{1..3200}.root --data-config config_files/config_hits_track_v2_noise.yaml -clust -clust_dim 3 --network-config src/models/wrapper/example_mode_gatr_noise.py --model-prefix /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/gun_dr_025_v3/ --num-workers 0 --gpus 1,2,3 --batch-size 10 --start-lr 1e-3 --num-epochs 100 --optimizer ranger --fetch-step 0.01 --condensation --log-wandb --wandb-displayname Gatr_qmin3_Vlocal10_gun_logdr_ecalhcalweightsLA --wandb-projectname mlpf_debug --wandb-entity ml4hep --frac_cluster_loss 0 --qmin 3 --use-average-cc-pos 0.99 --tracks --train-val-split 0.99

Energy correction and PID training

Firstly, the deep neural network part is pre-trained quickly using a small dataset. Empirically we found this is way faster and allows us to see results with training the full dataset quickly.

Step 1: Save the high-level features for training the DNN head

High-level features need to be exported firstly by adding --save-features flag:

See example command

python -m src.train_lightning1 --data-train /eos/experiment/fcc/ee/datasets/mlpf/CLD/train/gun_dr_050_v1_271124/pf_tree_{0..10}.root --data-config config_files/config_hits_track_v4.yaml -clust -clust_dim 3 --network-config src/models/wrapper/example_mode_gatr_e.py --model-prefix /eos/user/path/to/the/prefix --wandb-displayname export_features --num-workers 1 --prefetch-factor 16 --gpus 0 --batch-size 16 --start-lr 1e-3 --num-epochs 1 --optimizer ranger --fetch-step 0.01 --condensation --log-wandb --wandb-projectname mlpf_debug --wandb-entity fcc_ml --frac_cluster_loss 0 --qmin 1 --use-average-cc-pos 0.99 --lr-scheduler reduceplateau --tracks --correction --ec-model gatr-neutrals --regress-pos --add-track-chis --freeze-clustering --regress-unit-p --PID-4-class --restrict_PID_charge --separate-PID-GATr --save-features

Step 2: Train the energy correction head

python src/train_energy_correction_head.py --prefix /eos/user/path/to/the/prefix/ --wandb_name 27Nov_Muons_GT_Clusters --loss L1 --PIDs 130,2112 --dataset-path /eos/user/path/to/the/prefix/cluster_features --batch-size 16 --gnn-features-placeholders 16

The exported particles get automatically split in a train and eval set. The performance on the eval set can be seen from the wandb. Pick a good step (usually around 10k steps is enough). Then, load the head of the model in the training loop by adding the --ckpt-neutral path/to/model.ckpt. Important: --ckpt-neutral does NOT overwrite the model loaded with --load-model-weights (so the model loaded this way should only be the clustering model).

Step 3: Energy correction and PID - train on the full dataset

python -m src.train_lightning1 --data-train /eos/experiment/fcc/ee/datasets/mlpf/CLD/train/gun_dr_050_v1_271124/pf_tree_{0..1000}.root --data-config config_files/config_hits_track_v4.yaml -clust -clust_dim 3 --network-config src/models/wrapper/example_mode_gatr_e.py --model-prefix /eos/user/g/gkrzmanc/results/2024/E_PID_02122024_dr05_GT_clusters --wandb-displayname Train_Energy_corr_and_PID --num-workers 1 --prefetch-factor 16 --gpus 0 --batch-size 16 --start-lr 1e-3 --num-epochs 100 --optimizer ranger --fetch-step 0.01 --condensation --log-wandb --wandb-projectname mlpf_debug --wandb-entity fcc_ml --frac_cluster_loss 0 --qmin 1 --use-average-cc-pos 0.99 --lr-scheduler reduceplateau --tracks --correction --ec-model gatr-neutrals --regress-pos --add-track-chis --freeze-clustering --regress-unit-p --PID-4-class --restrict_PID_charge --separate-PID-GATr --ckpt-neutral /eos/user/g/gkrzmanc/results/2024/PID_muons_fix_features/training_energy_correction_head/model_step_8000_pid_2112.pkl

(change --ckpt_neutral to the appropriate model)

Notes on parallelization

--num-workers 1 --prefetch-factor 16 works well usually - it is beneficial to have a worker fetching data in the background while the model is being optimized

Additional arguments

  • --separate-PID-GATr: By turning this flag on, the GATr between energy correction and PID head are separated.
  • --restrict_PID_charge: Only train PID on cases that make sense (i.e. photons and etc.)
  • --n-layers-PID-head: default 1. Add more layers to the PID head, which is by default just a linear probe. Around 3 seem to work slightly better than 1.
  • Add the flag --use-gt-clusters to train on ground truth clusters. Useful for debugging issues, or when clustering for a dataset is not ready yet etc.

[archived] Energy calibration (DNN head)

With the codebase as of 15 October 2024: Model: /eos/user/g/gkrzmanc/2024/train/export_f_10_09_testset_300_files_avg_pos_reprod/intermediate_plots/model_step_23000_pid_130.pkl

10-15 particles dataset: --ckpt-neutral /eos/user/g/gkrzmanc/2024/export_clusters_1015_pxyz/model_DNN_neutral_pos/intermediate_plots/model_step_10000_pid_2112.pkl --ckpt-charged /eos/user/g/gkrzmanc/2024/export_clusters_1015_pxyz/model_DNN_charged/intermediate_plots/model_step_30000_pid_211.pkl (neutrals training: https://wandb.ai/fcc_ml/mlpf_debug_energy_corr/runs/nm26pzxq?nw=nwusergregorkrz, charged training: https://wandb.ai/fcc_ml/mlpf_debug_energy_corr/runs/a5uqwbs2?nw=nwusergregorkrz)

[archived] 4-vector

Training at https://wandb.ai/fcc_ml/mlpf_debug/runs/akwtssbw -> Full model /eos/user/g/gkrzmanc/2024/eval_1015/FT_Ep_reg_1015/FT_Ep_reg_1015/_epoch=19_step=24000.ckpt

Training with regression of unit p vectors: https://wandb.ai/fcc_ml/mlpf_debug/runs/kf2qugxs/overview Regression of unit p vector + simple PID (gamma/electron/CH/NH): https://wandb.ai/fcc_ml/mlpf_debug/runs/28loqimm/overview

Evaluation command python -m src.train_lightning1 --data-test /mnt/home/jkieseler/mlpf_energy_correction/datasets/10_15_v1_pxyz/pf_tree_{500..550}.root --data-config config_files/config_hits_track_v1.yaml -clust -clust_dim 3 --network-config src/models/wrapper/example_mode_gatr_e.py --model-prefix /mnt/home/jkieseler/mlpf_energy_correction/results/eval_FT_Ep_reg_1015 --wandb-displayname eval_FT_Ep_reg_1015 --num-workers 0 --gpus 0 --batch-size 8 --start-lr 1e-3 --num-epochs 100 --optimizer ranger --fetch-step 0.1 --condensation --log-wandb --wandb-projectname mlpf_debug --wandb-entity fcc_ml --frac_cluster_loss 0 --qmin 1 --use-average-cc-pos 0.99 --lr-scheduler reduceplateau --tracks --correction --ec-model gatr-neutrals --regress-pos --add-track-chis --load-model-weights /mnt/home/jkieseler/mlpf_energy_correction/results/FT_Ep_reg_1015/_epoch\=19_step\=24000.ckpt --freeze-clustering --predict --ckpt-neutral /mnt/home/jkieseler/mlpf_energy_correction/models/ckpt_EC_DNN/neutral_10000_250624.ckpt --ckpt-charged /mnt/home/jkieseler/mlpf_energy_correction/models/ckpt_EC_DNN/charged_10000_250624.ckpt

Evaluation with ML PID: https://wandb.ai/fcc_ml/mlpf_debug/runs/yryih59e/overview Training with p vector + ML PID regression: https://wandb.ai/fcc_ml/mlpf_debug/runs/28loqimm?nw=nwusergregorkrz