HTCondor - gipert/ECM GitHub Wiki

An HTCondor batch system is configured between the slave nodes. HTCondor is installed and configured on each slave node by elastiq as specified in the /etc/elastiq.conf file (for further details see here). The configuration files can be found under /etc/condor/.

Docker support

HTCondor provides a universe with Docker support, the docker universe, that allows to run jobs on the slave nodes within a Docker container. General instructions on how to write a submit description file for the docker universe can be found in the official documentation.

A note beside the official manual pages: as specified there, a docker container is 'read-only' and provides an isolated filesystem in which a job can be executed, therefore any file written inside a container is lost on exit. The official docs suggest to use the two should_transfer_files and when_to_transfer_output directives. A more efficient alternative, especially when dealing with heavy output files (i.e. Monte Carlo simulations), is to write them directly on an external volume mounted on the container. HTCondor allows the user to specify a volume to be automatically mounted on a container on the slave nodes by setting few more variables, as described here.

Currently the /common folder (shared in the network thanks to the nfs-server instance nfs+docker-server and mounted during the boot sequence on each slave) is automatically mounted by HTCondor on the container (invoked by the submitted job) thanks to the lines in /etc/condor/condor_config.local:

DOCKER_VOLUMES = COMMON
DOCKER_VOLUME_DIR_COMMON = /common:/common
DOCKER_MOUNT_VOLUMES = COMMON

To add another folder (already present on the slave, see how to add a shared folder) to be mounted on the containers you can change the lines above as follows:

DOCKER_VOLUMES = COMMON, <NEW_DIR>
DOCKER_VOLUME_DIR_COMMON = /common:/common
DOCKER_VOLUME_DIR_<NEW_DIR> = <src_path>:<dest_path>
DOCKER_MOUNT_VOLUMES = COMMON, <NEW_DIR>

Warning: the config file must be modified on each slave (including those that will be spawned in the future!). To accomplish this you must encode the modifications inside the elastiq config file

HTCondor submit file example to run MaGe jobs:

universe     = docker
docker_image = 10.64.28.50:5000/gerda-sw
executable   = MaGe
arguments    = $(filename)
log          = /common/log/$Fn(filename).log
output       = /common/log/$Fn(filename).out
error        = /common/log/$Fn(filename).err

Queue filename matching files /common/macros/*.mac

This simple script uses the gerda-sw image available in the local Docker hub to run the MaGe executable with the macro file is specified with the arguments directive. HTCondor looks for all the files in /common/macros/ (/common is the nfs-mounted folder which is then again mounted on the container) with the .mac extension and substitutes their name in the filename variable, creating a job for each one of them. It also saves the log files in /common/log/ using the file name without any extension. Learn how to write more complex submit files in the official manual pages.

Known issues

The docker run command does not look for any updates of <image>:<tag> on the local hub and currently there's no way to obtain this behaviour except for issuing a docker pull command before (see this github issue). This is actually a problem if you want to fetch an updated <image>:<tag> from the local Docker hub having an older version with the same tag already cached on the slave. Unfortunately neither HTCondor provides such an useful feature, so the only workaround is to wait for the re-spawning of the slave or manually issuing a docker pull command. Alternatively one can specify the digest of an image inside the submit file, as documented here.

⚠️ **GitHub.com Fallback** ⚠️