Job Splitting - TC01/Treemaker GitHub Wiki

Job Splitting

Sometimes, there are so many ntuples in one dataset you want to run over that it would take far too long, even with Treemaker-style multiprocessing. (Sometimes there are so many that attempting to use Treemaker multiprocessing causes strange behavior). To that end, you can use job splitting.

There are two ways you can ask your jobs to be split. They are referred to as "split into" and "split by" respectively (I apologize in advance for these names).

As a rule, you should not split your job into fewer ntuples than there are cores on the system(s) you are running Treemaker on. Because Treemaker is fully multi-cored anyway, this will not be more efficient.

Split Into N

"Split Into N" means you want to split your job into X jobs of size N. That means, each job will run over at most N ntuples.

Use this option if, for example, you know how many cores your system has and want to limit the number of jobs that get scheduled at a time to the number of cores on the system. Let's say you have a 12-core machine and don't want to run more than 3x the number of cores, you could split into X jobs of size 36.

Split By N

"Split By N" means you want to split your job into N jobs of size X. This is the opposite of the above option.

Split By is useful if you know you want to run 5 separate jobs and don't necessarily care how many ntuples are in each one. If you have some sense as to the size of the dataset you are running over, this may be useful-- if you know it's a small dataset, for instance, 5 jobs might be sufficient, but if you know it's a large one maybe you'd want to use 20 or 30.

(Or 50 or 100, for really large samples).

How It Works

You can pass the --split-by N or --split-into N options to treemaker-config to set the splitting options in a configuration file. For more information about config files, see that section.

Note that if you try to specify both options at any point, an error will be thrown somewhere.

To run your jobs by hand with job splitting, you need to pass the --split-index I parameter to treemaker, asking for the Ith job to be executed. So --split-index 0 runs the first job, and so on.

If you're using the prototype condor integration, though, all you need to do is run:

treemaker-condor -fr _name_of_config_

And all the jobs will automatically be created and submitted.