Deterministic Pileup - dmwm/WMCore GitHub Wiki

Introduction

This reference page explains how the concept of deterministic pileup is implemented in the WMAgent system.

Summary

The latest modification to deterministic pileup occurred here: https://github.com/dmwm/WMCore/pull/5954/files

Since pre-mixing uses "data" mode for pileup (same as if you were mixing in actual minimum bias events [do we support this for real?]) this is what the behavior is:

Initialize a random number seeder based on the task name (which contains the workflow name).
Shuffle the order that the files will be read in based on this random # generator
- Each workflow will have a unique order of files read (e.g. order of events)
Each job in the workflow reads the events from the files in order globally
- Job 2 picks up right where job 1 left off

So unless a workflow has more events to generate than exist in the pre-mixed minimum bias sample, no event will be unused in a workflow. When comparing between workflows (output data sets) there will be overlap if the workflows used the same files. But just because you have one file in common with another dataset does not mean you have more. (The files are shuffled, not rotated in a random way.)

For ACDCs this has some complications. ACDCs do not pick up exact jobs that were not completed in the original workflow, so you cannot "fill in the gaps". ACDCs are new workflows so there will be, statistically, the same amount of overlap as two unrelated workflows. There are ways to mitigate this:

Don't use ACDCs. Request enough additional events that an ACDC is not needed (resubmit workflows from scratch that fall far short of what's needed.)
Don't worry about duplicated events in the combined WF + ACDC output (make the premixed sample large enough that it's very rare)
We could, potentially, do additional development to remove any premixed file used in the first step of the workflow from consideration in the ACDC.

See possible pitfalls at the end of the document as well

Details

How does non-deterministic pileup work?

In an nutshell, when a request specifies a MCPileup or DataPileup dataset, the workload definition includes this information in the processing task in order to have the files in these datasets included in the PSet file for the cmsRun processes in the jobs.

WorkQueue

The role of the WorkQueue for workloads with pileup is to read from DBS the list of all blocks with locations, and files in the dataset and store this in a JSON file that is used as a payload for the jobs.

WMRuntime

In runtime, each job will read the JSON payload and filter the blocks that are present at the site where it is currently running at. The files that are expected in the site's SE will be included in the fileNames attribute of the corresponding mixing module. For example, the following modifications will made to the PSet (in the process attribute) if both MCPileup and DataPileup is specified (this is a modified version of the code in the repo.

def modifyPSetForPileup(dataFilesInSE, mcFilesInSE):
  # First we find the MixingModules and DataMixingModules
  mixModules, dataMixModules = [], []
  prodsAndFilters = {}
  prodsAndFilters.update(self.process.producers)
  prodsAndFilters.update(self.process.filters)
  for key, value in prodsAndFilters.items():
    if value.type_() == "MixingModule":
      mixModules.append(value)
    if value.type_() == "DataMixingModule":
      dataMixModules.append(value)
  # Then we add the files to modules separately depending on the type
  for m in mixModules:
    inputTypeAttrib = getattr(m, "input", None) or getattr(m, "secsource", None)
    if inputTypeAttrib is None:
      continue
    inputTypeAttrib.fileNames = cms.untracked.vstring()
    for lfn in mcFilesInSE:
      inputTypeAttrib.fileNames.append(lfn)
  for m in dataMixModules:
    inputTypeAttrib = getattr(m, "input", None) or getattr(m, "secsource", None)
    if inputTypeAttrib is None:
      continue
    inputTypeAttrib.fileNames = cms.untracked.vstring()
    for lfn in dataFilesInSE:
      inputTypeAttrib.fileNames.append(lfn)
  return

Deterministic Pileup

There is an argument for ReDigi which indicates the WMAgent that the DataPileup should be handled differently, the objective is to have reproducible pileup mixing. Now we describe the changes in the workflow when dealing with this deterministic pileup.

WMAgent

In the WMAgent side, there is an addition in the JobSplitting for the LumiBased and EventAwareLumiBased algorithms. The splitting algorithms will keep track of the number of processing jobs that have been created and issue a number of events to skip for each job which is equal to:

# Assuming job N
eventsToSkipInPileup = ((N-1) * eventsPerLumi * lumisPerJob)

WMRuntime

The runtime payload includes two new elements for this workflow:

Number of events to skip in the data pileup
Number of events in the pileup dataset per block

Since we filter the blocks that are present at the site, we first calculate the total number of events present in the blocks we will use from the pileup (usually the total number of events in the dataset). Then we determine the number of events to skip in the pileup by doing a modulo with the total number of events in the dataset, this way we rollback to the beginning of the pileup dataset if needed.

The modifications to the PSet look like this:

def modifyPSetForPileup(dataFilesInSE, mcFilesInSE, eventsToSkip):
  # First we find the MixingModules and DataMixingModules
  ...
  # Then we add the files to modules separately depending on the type,
  # It only changes for data pileup
  ...
  for m in dataMixModules:
    inputTypeAttrib = getattr(m, "input", None) or getattr(m, "secsource", None)
    if inputTypeAttrib is None:
      continue
    inputTypeAttrib.fileNames = cms.untracked.vstring()
    # We use Python sorting so the files are added always in the same order in all jobs
    for lfn in sorted(dataFilesInSE):
      inputTypeAttrib.fileNames.append(lfn)
    # Then we do the modifications for deterministic pileup
    inputTypeAttrib.skipEvents = cms.untracked.uint32(eventsToSkip)
    inputTypeAttrib.sequential = cms.untracked.bool(True)
  return

Possible pitfalls

In the WorkQueue, the input dataset is split by blocks and each block is acquired individually by the WMAgents. It is possible that two blocks from the input dataset land in different WMAgents, then the job counts are not known between WMAgents so the jobs in each WMAgent will start using the pileup dataset from the beginning. This can be mitigated by assigning to a team which is only configured on one agent.
If the input dataset doesn't have an uniform number of events per lumi (e.g. due to filter efficiencies in MC datasets), then the calculation of events to skip in the pileup dataset won't be accurate and there could be holes in the intervals of events used from the pileup dataset in the jobs.
If jobs are submitted to multiple sites and those sites have different lists of available pileup blocks, one can again get duplication because the lists will be shuffled in different ways.