Training and models - selvaggi/mlpf GitHub Wiki

Model checkpoints

Clustering only

Trained on dataset Notes Checkpoint **Wandb link **
dR=0.5 /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/gun_dr_025_v3/_epoch=0_step=6500.ckpt
gun_dr_log_logE_v0_061224 BEST so far (current eval for the note) /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/gun_drlog_v7/_epoch=5_step=42000.ckpt very similar to /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/gun_drlog_v9_dr01/ https://wandb.ai/ml4hep/mlpf_debug/runs/3vs03stt/overview
Hss model evaluation was done on /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/181024_Hss/ https://wandb.ai/ml4hep/mlpf_debug/runs/imme1iwf/overview
Zuds model with new GT /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/Zuds_2025_10_03/ https://wandb.ai/ml4hep/mlpf_debug/runs/zmti364a?nw=nwuserdoloresg382

Clustering from above, training of energy correction + PID

Trained on dataset Notes Checkpoint Wandb link
dR=0.5 GT clusters https://wandb.ai/fcc_ml/mlpf_debug/runs/8tx8cprl
dR=0.5 https://wandb.ai/fcc_ml/mlpf_debug/runs/bwfdy7sq?nw=nwusergregorkrz
Hss /eos/user/g/gkrzmanc/results/2024/051124_fixPID_train_E_PID_Hss_clustering_CONT2/_epoch=2_step=1500.ckpt

Training guidelines

Training of clustering

Important! In order to train properly the limit_train_batches of train_lightning1 needs to be modified as: #number of files (10000) * # of events per file (100) * train-val-split (0.98) / (batch_size*number_of_gpus) Trains with 1M events python -m src.train_lightning1 --data-train /tmp/Zuds_2025_09_29_key4hep_20250529_CLD_r20250526/pf_tree_{1..10000}.parquet --data-config config_files/config_hits_track_v4.yaml -clust -clust_dim 3 --network-config src/models/wrapper/example_mode_gatr_noise.py --model-prefix /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/Zuds_2025_10_01/ --num-workers 16 --gpus 0,1,2,3 --batch-size 20 --start-lr 1e-3 --num-epochs 100 --optimizer ranger --fetch-step 4 --condensation --log-wandb --wandb-displayname drlog_alltracksbeta1 --wandb-projectname mlpf_debug --wandb-entity ml4hep --frac_cluster_loss 0 --qmin 3 --use-average-cc-pos 0.98 --tracks --train-val-split 0.98 --fetch-by-files

Energy correction and PID training:

This trains both the E correction for neutrals and the PID for charged and neutrals, 3 'heads' in total python -m src.train_lightning1 --data-train /eos/experiment/fcc/users/m/mgarciam/mlpf/CLD/train/Zuds_2025_09_29_key4hep_20250529_CLD_r20250526/ --data-config config_files/config_hits_track_v4.yaml -clust -clust_dim 3 --network-config src/models/wrapper/example_mode_gatr_noise.py --model-prefix /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/E_PID_Ecor_3layer_pid_GTClusters_restricpid_v4/ --wandb-displayname E_PID_Ecorneutral_GTClusters --gpus 1 --batch-size 20 --start-lr 1e-3 --num-epochs 100 --optimizer ranger --fetch-step 1 --condensation --log-wandb --wandb-projectname mlpf_debug --wandb-entity ml4hep --frac_cluster_loss 0 --qmin 1 --use-average-cc-pos 0.99 --lr-scheduler reduceplateau --tracks --correction --ec-model gatr-neutrals --regress-pos --add-track-chis --freeze-clustering --regress-unit-p --separate-PID-GATr --load-model-weights /eos/user/m/mgarciam/datasets_mlpf/models_trained_CLD/gun_dr_025_v3/_epoch=0_step=6500.ckpt --n-layers-PID-head 3 --use-gt-clusters --fetch-by-files --train-val-split 0.98 --restrict_PID_charge --PID-4-class --balance-pid-classes