Investigate feasibility of creating Cleanup jobs from a central microservice - dmwm/WMCore GitHub Wiki

The issue (https://github.com/dmwm/WMCore/issues/10345)

We are considering to implement a new microservice to be hosted in CMSWEB, which will be responsible for fetching a dump of all the unmerged files in a given RSE (from the Consistency Checking tool), figuring out which files are no longer needed in the system, and providing those unneeded files as input to cleanup jobs submitted to one of our production schedds. Those cleanup jobs should be very similar (if not equal) to our standard production Cleanup jobs, thus running locally on the site CPU resources and deleting files "locally".

Proposed solution

This is meant to be a somehow high level investigation on how feasible it would be to reuse the WMCore/WMSpec/Steps package to create the Cleanup jobs. We also have to explore how we could use the WMCore/Storage/Backends package to use the correct tool for the file deletions. A third point of investigation will be on the job post processing, such that we can identify what succeeded or failed, and whether files managed to be deleted or not. One could argue that, as a first alpha product, we trust that part of the files will be deleted and there is no need to know how many jobs are succeeding, and how many files are successfully deleted.

The solution will likely be in the form of a document explaining how to plug all these pieces, what is required, etc. No need to make any implementation at this stage.

Investigation

In order to create a micro service such as this, to create clean up jobs that are run at the sites themselves, we need to be able to:

Submit jobs from the Micro Service node/pod to a remote production schedd
Create the clean up jobs, with all dependencies needed
Read the site local config and figure out the right mechanism to delete the files "locally"
Check if files were indeed deleted or not (post job processing)

Submitting from a Micro Service to a remote production schedd

Short version: It can be done. Details to follow:

Authentication mechanism between service and production schedd:

This can be done at the host level, by including the proper subnet as part of:

ALLOW_DAEMON
ALLOW_NEGOTIATOR

or possibly by using token authentication (see: https://agenda.hep.wisc.edu/event/1579/contributions/23053/attachments/7870/8965/BockelmanHTCondorWeek2021.mp4)

Python Submission example

The following example submits a job from an external machine to login-el7.uscms.org and and the same can be used to submit to a production schedd:

From the schedd, query the MyAddress classad:

$ condor_status -schedd -af MyAddress
<192.170.227.182:9618?addrs=192.170.227.182-9618&alias=login-el7.uscms.org&noUDP&sock=schedd_1934190_0a19>

From the external client (the micro service), submit a job to the remote schedd:

#! /bin/env python

# Requires: Valid VO CMS proxy certificate (in this  example: /tmp/x509up_$(id -u)

import os
import htcondor
import classad

import logging


def submit(schedd, sub):
    """Submit condor job to local schedd.
    :param schedd: The local
    """
    try:
        with schedd.transaction() as txn:
            clusterid = sub.queue(txn)
    except Exception as e:
        logging.debug("Error submission: {0}".format(e))
        raise e

    return clusterid


schedd_ad = classad.ClassAd()
schedd_ad["MyAddress" ] = "<192.170.227.182:9618?addrs=192.170.227.182-9618&alias=login-el7.uscms.org&noUDP&sock=schedd_1934190_0a19>"
schedd =  htcondor.Schedd(schedd_ad)

sub = htcondor.Submit()
sub['executable'] = '/tmp/hello.sh'
sub['Output'] = '/tmp/result.$(Cluster)-$(Process).out'
sub['Error'] = '/tmp/result.$(Cluster)-$(Process).err'
sub['My.x509userproxy'] = classad.quote('/tmp/x509up_%s' % os.getuid())

clusterid =  submit(schedd, sub)

Note the remote submission assumes paths to be available in the remote schedd ( e.g.: /tmp/hello.sh is expected to exist in the remote schedd).

Notes: If we want these jobs to run at a particular site, we likely need to replicate some of the classads we use in the condor plugin, especially DESIRED_Sites.

Create the clean up jobs, with all dependencies needed

One goal while creating the clean up job, would be to reuse as much as possible, the logic and code of the delete step executor.

Since this would be a MicroService, we would need to be able to use the WMCore framework from it, in order to create the sandbox and a job package.

Job packages and sandboxes can be created following examples 1(https://github.com/dmwm/WMCore/blob/1bb206d606c4acc09c29a19c08a057a51de2c235/test/python/WMCore_t/WMRuntime_t/SandboxCreator_t.py) 2(https://github.com/dmwm/WMCore/blob/1bb206d606c4acc09c29a19c08a057a51de2c235/test/python/WMCore_t/DataStructs_t/JobPackage_t.py)

The WMCore could be added to CVMFS in order to solve this dependency in the Micro Service.

How are executors executed?

Here is the general procedure following in jobs at runtime

We start with a working area with a sandbox, a job package and job index (transferred via condor), using the SimpleCondorPlugin
Condor executes bash script that unpacks the job
The startup script is invoked
The script above will load the job definition and run the executor

Read the site local config and figure out the right mechanism to delete the files "locally"

Re-using the delete files executor above, would take care of the technical details regarding figuring out how to delete a file using the proper commands for the site (via the DeleteMgr), but if we want to avoid creating a job package and building a sandbox from the Micro Service, an alternative would be to create a job that directly reads a few things from WMCore via CVMFS and use it to this specific task (deleting files).

A job that gets the WMCore dependency from CVMFS, could load the site local config to get the local stage out command and use it for deleting the files (see the StageOutMgr for example) using the Storage Registry

# Loading a site config example

from WMCore.Storage.SiteLocalConfig import loadSiteLocalConfig
from WMCore.Storage.Registry import retrieveStageOutImpl

# Load site config (by default: "$CMS_PATH/SITECONF/local/JobConfig/site-local-config.xml")
# but this can be overridden by setting the following environment variable: overVarName
mySiteConfig = loadSiteLocalConfig()
print("Site name = %s" % mySiteConfig.siteName)

# Get local stageout command
cmd = mySiteConfig.localStageOut.get("command", None)
options = mySiteConfig.localStageOut.get('option', None)
print("cmd, options = %s, %s" % (cmd, options))

# Get stage-out implementation using storage backends
impl = retrieveStageOutImpl(cmd)
impl.numRetries = 3
impl.retryPause = 30

# Delete some file given its lfn
lfn = "/some/lfn/path"
tfc = mySiteConfig.trivialFileCatalog()
pfn = tfc.matchLFN(tfc.preferredProtocol, lfn)

try:
    impl.removeFile(pfn)
except Exception as ex:
    print("Failed to delete file: %s", pfn)
    raise ex

Check if files were indeed deleted or not (post job processing)

The DeleteMgr checks whether a file was successfully deleted or not.

We could either check at runtime, what files were successfully deleted or not and put it in a report. However, if the "list of files to be deleted" doesn't depend on this to be updated, we could simply leave it as-is.

Conclusion

It is feasible to submit jobs from a MS that delete files locally at the sites. The condor python bindings provide the functionality to submit to a remote schedd with the proper authentication mechanism.

For the clean job itself, we could either borrow as much as possible from the WMCore DeleteFiles executor, but I would lean towards using the SiteLocalConfig and DeleteMgr directly, to avoid building sandboxes and creating a job package for a single job and task, as it is more straightforward and easier to understand.