Condor integration - TC01/Treemaker GitHub Wiki

Condor Integration

Treemaker v1.0 ships with prototype condor integration. As mentioned in the README, Treemaker was written for and tested on the cmslpc cluster at Fermilab, which uses the HTCondor batch system.

The advantages of the condor integration are this: if you want to split your job, condor can run all N jobs at once without you having to manually pass --split-index 0, --split-index 1, etc. to successive treemaker commands. In theory, this is great for automation.

However, I have run into a number of drawbacks trying to do this that I have not yet been able to resolve (see "Drawbacks" below), and thus I can only half-heartedly recommend using this feature at the moment.

Using treemaker-condor

Using treemaker-condor is quite simple. Run the following command to create a condor job. It is assumed you are doing this on cmslpc, or on some other environment with HTCondor installed.

treemaker-condor -f _name_of_config_file

To submit it, you can then run the following (or, pass the -r option to treemaker-condor to auto-submit).

cd _job_directory_/
condor_submit _job_name_

The values job_directory and job_name will, by default, be the name of the Treemaker config file with a trailing ".cfg" removed from them.

Cleanup

After you have verified that all the jobs ran successfully, and resubmitted and reran any that did not (see the next section below for more information), a cleanup script is automatically copied into every treemaker-condor working directory. To run it, merely run the following: python hadd_output.py inside the working directory.

Caution: hadd_output.py was written in a hurry and lacks some graceful error handling. All it will do is determine whether there is only one output file or many, and if there is only one, copy it to the working directory, and if there are many, hadd them together. After attempting to extract output files and hadd them together, hadd_output.py will remove the output files and directories that it used to do the hadd or copy, however.

Improvements in this area will likely come in the next Treemaker release.

Drawbacks

Unfortunately, after using it quite extensively in my analysis, I've found it doesn't work nearly as well as I would like. Treemaker jobs running in condor tend to take longer to finish, and seem to, occasionally, "get stuck"-- enter a strange state and never return from them. Or, they will return but with no output files. This happens rarely-- on a bad day, perhaps 1 in 20 jobs will fail, and seems more common the larger the dataset they ran on.

By examining the logs from jobs that do return, it seems that there are occasional EOS read timeouts that cause issues. Frequently, resubmitting the jobs that failed (by editing the condor configuration file) solve the issue. In the event that they do not, run the jobs manually.

I speculate that it is something to do with the multiprocessing nature of Treemaker.

There is not yet an automated tool to resubmit the failed jobs.

Future

I would love to make this (or some other sort of batch system integration) much more polished in the future. It would be ironic if Treemaker ended up shipping crab integration for people with grid certificates, wouldn't it? :)