Dataset creation - selvaggi/mlpf GitHub Wiki

To create datasets look into the condor_CLD folder. The script_create_dataset_train.sh calls the submit_jobs_train.py which starts the condor jobs using the chain in run_sequence_CLD_train_*.py

It is important to keep up to date the CLDConfig version and to use the geometry file (CLD_o2_v0X.xml) that matches the reconstruction in CLDConfig. Also to adapt the key4hep to the last stable available version.

Some other tricks for condor is to bump your shed to the most available one: myschedd bump. If you change your scheed previous jobs will stay in the one where you started the job. You can check witch one you are using with myschedd show, and go back to another one with myschedd set bigbirdXX.cern.ch.

important remarks Your condor jobs can fail because:

  1. You did not create a directory in advance
  2. The disk is full
  3. There is a maximum number of files quota on the disk and we reached it

The branches what are absolutely needed to create the output are: out.outputCommands = [ "drop *", "keep *MCParticles*", "keep *Pandora*", "keep *SiTracks*", "keep *MCTruthLink*", "keep *Clusters*", "keep *RecoParticles*", "keep *MUON*", "keep *ECALBarrel*", "keep *ECALEndcap*", "keep *HCALBarrel*", "keep *HCALEndcap*", "keep *HCALOther*", "keep *TrackerHit*", "keep *Vertices*", ]

and SiTracks_Refitted, SiTracksMCTruthLink, PandoraPFOs, ECALOther, CalohitMCTruthLink, PandoraClusters

The model uses parquet files. The conversion from edm4hep to the required format is done in using the scripts in this repo https://github.com/doloresgarcia/MLPF_datageneration