Training and models - selvaggi/mlpf GitHub Wiki
Model checkpoints
Clustering only
Trained on dataset | Notes | Checkpoint | **Wandb link ** |
---|---|---|---|
dR=0.5 | /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/gun_dr_025_v3/_epoch=0_step=6500.ckpt |
||
gun_dr_log_logE_v0_061224 | BEST so far (current eval for the note) | /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/gun_drlog_v7/_epoch=5_step=42000.ckpt very similar to /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/gun_drlog_v9_dr01/ |
https://wandb.ai/ml4hep/mlpf_debug/runs/3vs03stt/overview |
Hss | model evaluation was done on | /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/181024_Hss/ |
https://wandb.ai/ml4hep/mlpf_debug/runs/imme1iwf/overview |
Clustering from above, training of energy correction + PID
Trained on dataset | Notes | Checkpoint | Wandb link |
---|---|---|---|
dR=0.5 GT clusters | https://wandb.ai/fcc_ml/mlpf_debug/runs/8tx8cprl | ||
dR=0.5 | https://wandb.ai/fcc_ml/mlpf_debug/runs/bwfdy7sq?nw=nwusergregorkrz | ||
Hss | /eos/user/g/gkrzmanc/results/2024/051124_fixPID_train_E_PID_Hss_clustering_CONT2/_epoch=2_step=1500.ckpt |
Training guidelines
Training of clustering
Trains with 320k events
python -m src.train_lightning1 --data-train /eos/experiment/fcc/ee/datasets/mlpf/CLD/train/gun_dr_050_v1_271124/pf_tree_{1..3200}.root --data-config config_files/config_hits_track_v2_noise.yaml -clust -clust_dim 3 --network-config src/models/wrapper/example_mode_gatr_noise.py --model-prefix /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/gun_dr_025_v3/ --num-workers 0 --gpus 1,2,3 --batch-size 10 --start-lr 1e-3 --num-epochs 100 --optimizer ranger --fetch-step 0.01 --condensation --log-wandb --wandb-displayname Gatr_qmin3_Vlocal10_gun_logdr_ecalhcalweightsLA --wandb-projectname mlpf_debug --wandb-entity ml4hep --frac_cluster_loss 0 --qmin 3 --use-average-cc-pos 0.99 --tracks --train-val-split 0.99
Energy correction and PID training
Firstly, the deep neural network part is pre-trained quickly using a small dataset. Empirically we found this is way faster and allows us to see results with training the full dataset quickly.
Step 1: Save the high-level features for training the DNN head
High-level features need to be exported firstly by adding --save-features
flag:
See example command
python -m src.train_lightning1 --data-train /eos/experiment/fcc/ee/datasets/mlpf/CLD/train/gun_dr_050_v1_271124/pf_tree_{0..10}.root --data-config config_files/config_hits_track_v4.yaml -clust -clust_dim 3 --network-config src/models/wrapper/example_mode_gatr_e.py --model-prefix /eos/user/path/to/the/prefix --wandb-displayname export_features --num-workers 1 --prefetch-factor 16 --gpus 0 --batch-size 16 --start-lr 1e-3 --num-epochs 1 --optimizer ranger --fetch-step 0.01 --condensation --log-wandb --wandb-projectname mlpf_debug --wandb-entity fcc_ml --frac_cluster_loss 0 --qmin 1 --use-average-cc-pos 0.99 --lr-scheduler reduceplateau --tracks --correction --ec-model gatr-neutrals --regress-pos --add-track-chis --freeze-clustering --regress-unit-p --PID-4-class --restrict_PID_charge --separate-PID-GATr --save-features
Step 2: Train the energy correction head
python src/train_energy_correction_head.py --prefix /eos/user/path/to/the/prefix/ --wandb_name 27Nov_Muons_GT_Clusters --loss L1 --PIDs 130,2112 --dataset-path /eos/user/path/to/the/prefix/cluster_features --batch-size 16 --gnn-features-placeholders 16
The exported particles get automatically split in a train and eval set. The performance on the eval set can be seen from the wandb. Pick a good step (usually around 10k steps is enough). Then, load the head of the model in the training loop by adding the --ckpt-neutral path/to/model.ckpt
. Important: --ckpt-neutral
does NOT overwrite the model loaded with --load-model-weights
(so the model loaded this way should only be the clustering model).
Step 3: Energy correction and PID - train on the full dataset
python -m src.train_lightning1 --data-train /eos/experiment/fcc/ee/datasets/mlpf/CLD/train/gun_dr_050_v1_271124/pf_tree_{0..1000}.root --data-config config_files/config_hits_track_v4.yaml -clust -clust_dim 3 --network-config src/models/wrapper/example_mode_gatr_e.py --model-prefix /eos/user/g/gkrzmanc/results/2024/E_PID_02122024_dr05_GT_clusters --wandb-displayname Train_Energy_corr_and_PID --num-workers 1 --prefetch-factor 16 --gpus 0 --batch-size 16 --start-lr 1e-3 --num-epochs 100 --optimizer ranger --fetch-step 0.01 --condensation --log-wandb --wandb-projectname mlpf_debug --wandb-entity fcc_ml --frac_cluster_loss 0 --qmin 1 --use-average-cc-pos 0.99 --lr-scheduler reduceplateau --tracks --correction --ec-model gatr-neutrals --regress-pos --add-track-chis --freeze-clustering --regress-unit-p --PID-4-class --restrict_PID_charge --separate-PID-GATr --ckpt-neutral /eos/user/g/gkrzmanc/results/2024/PID_muons_fix_features/training_energy_correction_head/model_step_8000_pid_2112.pkl
(change --ckpt_neutral to the appropriate model)
Notes on parallelization
--num-workers 1 --prefetch-factor 16
works well usually - it is beneficial to have a worker fetching data in the background while the model is being optimized
Additional arguments
--separate-PID-GATr
: By turning this flag on, the GATr between energy correction and PID head are separated.--restrict_PID_charge
: Only train PID on cases that make sense (i.e. photons and etc.)--n-layers-PID-head
: default1
. Add more layers to the PID head, which is by default just a linear probe. Around 3 seem to work slightly better than 1.- Add the flag
--use-gt-clusters
to train on ground truth clusters. Useful for debugging issues, or when clustering for a dataset is not ready yet etc.
[archived] Energy calibration (DNN head)
With the codebase as of 15 October 2024: Model: /eos/user/g/gkrzmanc/2024/train/export_f_10_09_testset_300_files_avg_pos_reprod/intermediate_plots/model_step_23000_pid_130.pkl
10-15 particles dataset: --ckpt-neutral /eos/user/g/gkrzmanc/2024/export_clusters_1015_pxyz/model_DNN_neutral_pos/intermediate_plots/model_step_10000_pid_2112.pkl --ckpt-charged /eos/user/g/gkrzmanc/2024/export_clusters_1015_pxyz/model_DNN_charged/intermediate_plots/model_step_30000_pid_211.pkl
(neutrals training: https://wandb.ai/fcc_ml/mlpf_debug_energy_corr/runs/nm26pzxq?nw=nwusergregorkrz, charged training: https://wandb.ai/fcc_ml/mlpf_debug_energy_corr/runs/a5uqwbs2?nw=nwusergregorkrz)
[archived] 4-vector
Training at https://wandb.ai/fcc_ml/mlpf_debug/runs/akwtssbw -> Full model /eos/user/g/gkrzmanc/2024/eval_1015/FT_Ep_reg_1015/FT_Ep_reg_1015/_epoch=19_step=24000.ckpt
Training with regression of unit p vectors: https://wandb.ai/fcc_ml/mlpf_debug/runs/kf2qugxs/overview Regression of unit p vector + simple PID (gamma/electron/CH/NH): https://wandb.ai/fcc_ml/mlpf_debug/runs/28loqimm/overview
Evaluation command
python -m src.train_lightning1 --data-test /mnt/home/jkieseler/mlpf_energy_correction/datasets/10_15_v1_pxyz/pf_tree_{500..550}.root --data-config config_files/config_hits_track_v1.yaml -clust -clust_dim 3 --network-config src/models/wrapper/example_mode_gatr_e.py --model-prefix /mnt/home/jkieseler/mlpf_energy_correction/results/eval_FT_Ep_reg_1015 --wandb-displayname eval_FT_Ep_reg_1015 --num-workers 0 --gpus 0 --batch-size 8 --start-lr 1e-3 --num-epochs 100 --optimizer ranger --fetch-step 0.1 --condensation --log-wandb --wandb-projectname mlpf_debug --wandb-entity fcc_ml --frac_cluster_loss 0 --qmin 1 --use-average-cc-pos 0.99 --lr-scheduler reduceplateau --tracks --correction --ec-model gatr-neutrals --regress-pos --add-track-chis --load-model-weights /mnt/home/jkieseler/mlpf_energy_correction/results/FT_Ep_reg_1015/_epoch\=19_step\=24000.ckpt --freeze-clustering --predict --ckpt-neutral /mnt/home/jkieseler/mlpf_energy_correction/models/ckpt_EC_DNN/neutral_10000_250624.ckpt --ckpt-charged /mnt/home/jkieseler/mlpf_energy_correction/models/ckpt_EC_DNN/charged_10000_250624.ckpt
Evaluation with ML PID: https://wandb.ai/fcc_ml/mlpf_debug/runs/yryih59e/overview Training with p vector + ML PID regression: https://wandb.ai/fcc_ml/mlpf_debug/runs/28loqimm?nw=nwusergregorkrz