HTCondor - gipert/ECM GitHub Wiki
An HTCondor batch system is configured between the slave nodes. HTCondor is installed and configured on each slave node by elastiq as specified in the /etc/elastiq.conf
file (for further details see here). The configuration files can be found under /etc/condor/
.
HTCondor provides a universe with Docker support, the docker
universe, that allows to run jobs on the slave nodes within a Docker container. General instructions on how to write a submit description file for the docker
universe can be found in the official documentation.
A note beside the official manual pages: as specified there, a docker container is 'read-only' and provides an isolated filesystem in which a job can be executed, therefore any file written inside a container is lost on exit. The official docs suggest to use the two should_transfer_files
and when_to_transfer_output
directives. A more efficient alternative, especially when dealing with heavy output files (i.e. Monte Carlo simulations), is to write them directly on an external volume mounted on the container. HTCondor allows the user to specify a volume to be automatically mounted on a container on the slave nodes by setting few more variables, as described here.
Currently the /common
folder (shared in the network thanks to the nfs-server instance nfs+docker-server
and mounted during the boot sequence on each slave) is automatically mounted by HTCondor on the container (invoked by the submitted job) thanks to the lines in /etc/condor/condor_config.local
:
DOCKER_VOLUMES = COMMON
DOCKER_VOLUME_DIR_COMMON = /common:/common
DOCKER_MOUNT_VOLUMES = COMMON
To add another folder (already present on the slave, see how to add a shared folder) to be mounted on the containers you can change the lines above as follows:
DOCKER_VOLUMES = COMMON, <NEW_DIR>
DOCKER_VOLUME_DIR_COMMON = /common:/common
DOCKER_VOLUME_DIR_<NEW_DIR> = <src_path>:<dest_path>
DOCKER_MOUNT_VOLUMES = COMMON, <NEW_DIR>
Warning: the config file must be modified on each slave (including those that will be spawned in the future!). To accomplish this you must encode the modifications inside the elastiq config file
HTCondor submit file example to run MaGe jobs:
universe = docker
docker_image = 10.64.28.50:5000/gerda-sw
executable = MaGe
arguments = $(filename)
log = /common/log/$Fn(filename).log
output = /common/log/$Fn(filename).out
error = /common/log/$Fn(filename).err
Queue filename matching files /common/macros/*.mac
This simple script uses the gerda-sw
image available in the local Docker hub to run the MaGe
executable with the macro file is specified with the arguments
directive. HTCondor looks for all the files in /common/macros/
(/common
is the nfs-mounted folder which is then again mounted on the container) with the .mac
extension and substitutes their name in the filename
variable, creating a job for each one of them. It also saves the log files in /common/log/
using the file name without any extension. Learn how to write more complex submit files in the official manual pages.
The docker run
command does not look for any updates of <image>:<tag>
on the local hub and currently there's no way to obtain this behaviour except for issuing a docker pull
command before (see this github issue). This is actually a problem if you want to fetch an updated <image>:<tag>
from the local Docker hub having an older version with the same tag already cached on the slave. Unfortunately neither HTCondor provides such an useful feature, so the only workaround is to wait for the re-spawning of the slave or manually issuing a docker pull
command. Alternatively one can specify the digest of an image inside the submit file, as documented here.