TaskChain vs StepChain - dmwm/WMCore GitHub Wiki

[Documentation still in progress]

This wiki tries to describe the main differences between a TaskChain and a StepChain request, when to use one or another, etc. A summary of these differences is described in the table below:

Request type TaskChain StepChain
Definition A chain of an undefinite number of tasks where one task reads the output of the previous one and so on. It could be that the output of one task is used by two subsequente tasks A chain of an undefinite number of steps where one step reads the output of the previous one and so on. The output of one step can also be read in by different steps
Grid behaviour Each task has its own set of grid jobs. Each task writes its output to the unmerged storage namespace, which is then merged with other files (going to the merged namespace) and finally the subsequent tasks read this merge file as input (save KeepOutput=False tasks) Each grid job executes all the steps. The output files are written to the unmerged storage namespace once all the steps are finished. Which then triggers a merge job per outputmodule and the output files go to the merged area
Request structure TaskChain argument contains the number of tasks in the request. Each task (Task1, Task2, Task3) contains it's own configuration. Arguments provided in the task dictionary have precedence over the top level value StepChain argument contains the number of Steps in the request. Each step can have its own definition as well, besides a few arguments that are only supported at top level request (TimePerEvent, SizePerEvent)
Pros Job splitting more flexible, since each tasks creates its own set of jobs. Lumi section size can be bigger. Each task can have a completely difference resources requirement. Request can be adjusted such that only the output of the last task is kept/saved. Accepts different CMSSW/ScramArch for each Task. Best for cloud diskless resources, since there is no dependency on the storage between the steps. Should have higher efficiency, since there is less time spent on the job bootstrap. Request can be made such that only the output of the latest step is kept/saved. Accepts different CMSSW/ScramArch for each Step.
Cons TaskChain recovery is painful. Tasks and merge dependencies adds a significant overhead to the request lifetime Potential resource wastage if steps have totally different resources needs. Currently does not support the same output module in different steps (with KeepOutput=True)

For a better documentation on how to create these requests, which arguments are mandatory and so on, please visit this WMCore wiki page. In addition to that, one can find a few request examples in DMWM and Integration.

Last but not least, there is a current limitation with StepChain requests that will not allow it to handle 100% of the possible use cases for it. An issue was created and it should be solved in the near future. In summary, one cannot save all the steps output when they have the same output module definition.

Use cases and reasons for NOT using StepChain

According to the explanation and comparison above, there are clear use cases where we would prioritize TaskChain requests over StepChain, those are (not exhaustive list):

  • request with Steps using a different number of cores (Multicore). It is NOT a limitation of StepChain, per se, but actually it affects the overall job efficiency, given StepChain runs it all in the same job. On the other hand, if the single core steps represent a very small portion of the total job length, you might still consider using StepChain.
  • request with Steps using different OS version (e.g., mixed steps using SLC6 and CC7). StepChain supports different CMSSW and different ScramArch, but if the ScramArch references a different OS, it will likely hit problems during runtime.
  • usage of TransientOutputModule is not supported in StepChain. So, if there is the need to save only a subset of the output modules from a cmsDriver command (Task/Step), StepChain cannot be used. It does support KeepOutput parameter though, with the meaning of dropping ALL outputs for a specific Step and not registering it into DBS/PhEDEx.