Using Splitter - TC01/Treemaker GitHub Wiki

Using Splitter

Since the automatic job splitting and ability to run jobs over remote datasets without having to use CRAB or other tools directly are useful features of Treemaker, the Splitter module was written to provide a way for users to take advantage of these features for tasks that don't necessarily involve running over ntuples and producing trees. While Splitter was introduced as the main feature of Treemaker v1.3 so that a different treemaker could be efficiently ran, any job that involves processing large numbers of ntuples or ROOT ttrees is potentially a candidate for Splitter.

Unlike Treemaker, Splitter is designed to be as generic as possible, and to not require a rewrite of your analysis code. When using Splitter, you simply write a Python configuration file that knows how to run your job and to pass it a set of files to run over. Then, you can use two command-line tools, run-split-job and multisplit, to actually run the job.

Configuration Files

Splitter "configuration files" (really Python modules) are of the following form:

def run(files, name_append=""):
    # the function to actually run your job, over the "files" parameter.
    name = name_root + name_append

def getFiles():
    # the function to retrieve a list of files, in any format Treemaker can understand

def getName():
    # the function to return the name of your job for internal Treemaker usage

The "name_append" parameter to the run function will usually be "Index#", where # is the job ID, but it may be the empty string if the user does not request any job splitting.

These "configuration files" are designed to be as simple as possible. The only function that should need to do any work is run; the idea is that from this configuration, you will import the script that actually runs your jobs, and then call a function in that script to run the job, passing it a list of files to run over. From my experience with such scripts, it's likely that you may may need to modify it in two ways (depending on how you originally wrote it) for it to be usable as part of a Splitter job:

  1. Wrap all your code under an if __name__ == '__main__': block. This prevents code from being ran when the Python script is imported as a module from another module or script, but not when it's ran directly as a script. (This is common practice for Python software, but is not always done in scientific scripts).

  2. Define a function that takes at least one argument, a list of files to run over, and have that function be the thing that actually starts your jobs.

Once you've modified your script as above, you should be able to import it from your Splitter configuration file and invoke it from the run function.

(Note that you could pass the list of files as an argument via the command line, or do something more esoteric. This is simply the recommended way to run your jobs).

Example

An example Splitter configuration file is available in Splitter/data/electron_2015C.py. This was designed to work with a Treemaker being used in a Z' to tT' analysis; the entire treemaker was not included, just the modified script to run it modified as discussed above.

Using run-split-job

Once you have a configuration file, you can use the run-split-job command to run jobs. It is invoked as follows:

run-split-job (--split-into N | --split-by N) [--split-index i] ./configuration.py

Where "./configuration.py" is the path to your Splitter config file, written as described above.

The job splitting arguments work exactly as they do for Treemaker itself; please consult that documentation to gain an understanding of how things work.

Multiprocessing split jobs

The main disadvantage run-split-job has over Treemaker proper is that it doesn't do any parallel processing by default. Thus, the multisplit tool was written in order to allow parallel processing of the split jobs. It is invoked as:

multisplit (--split-into N | --split-by N) ./configuration.py

This lets you run all the split jobs in parallel. Note that all jobs will not run simultaneously, just as is the case with treemaker: you will be bottlenecked by the number of cores on the machine you are using for Splitter. However, they will all eventually run without the user having to manually start them.

Advanced multisplit usage

Sometimes running the entire job with multisplit will simply take too long (especially when using the xRootD integration). Thus, to allow more fine-grained control over what jobs actually run, two options were added to multisplit that are not present in normal Treemaker job splitting.

  • --start-at X says "start running at the Xth split job" and run until the end.

  • --stop-at Y says "run until the Yth split job".

These can, of course, be mixed to run from X until Y.

As an example of why you might want this functionality: if you want to run 2000 jobs, but know that it will take multisplit 24 hours to do so (perhaps by bitter experience), you can run multisplit --start-at 0 --stop-at 500, multisplit --start-at 500 --stop-at 1000, and so on on four different machines in order to do the entire processing in six hours.

Limitations

  • Unlike treemaker, there is not currently condor integration. (But you can do this manually, if you want).

  • The output files of multisplit jobs are not automatically hadd'd together, so this must be done manually.