Logbook for production issues - dmwm/WMCore GitHub Wiki

This week is meant to keep track of WMCore-related problems affecting the central production infrastructure, allowing to have a history of such problems as well as making it easier to identify problems after many weeks/months have passed.

Jobs executing where there is no pileup (exit code: 8029 - NoSecondaryFiles)

Problem: as reported in this GH issue #9658, with the HG2004 CMSWEB production deployment in April, we have fully enabled the MSTransferor and MSMonitor; at the same time, automatic input data placement has been disabled on the Unified side. This means that workflows no longer get assigned to (SiteWhitelist) only where the data is available. So, it's quite common to have a large SiteWhitelist, while only a couple of sites host the pileup dataset.

UPDATE (24/Jun/2020):In addition to that, we have also noticed that MSTransferor does NOT enforce primary block data placement in the same sites hosting the pileup dataset (eventually causing jobs to have an empty list of secondary files).

Solution: this PR #9659 makes an intersection of the MCFakeFile location against the secondary locations. Thus, jobs without input dataset but with a secondary at a later stage would only get executed at sites that also host part of the secondary data. This fix is only available starting in WMAgent 1.3.3 releases.

Impacted dates: Unified was disabled on 8/april/2020, so any WQE acquired between that date and 24/april/2020 could have been affected.

WMAgent setting an unsupported configuration parameter: enforceGUIDInFileName (exit code: 8009 - Configuration)

Problem: this parameter enforceGUIDInFileName is a new feature requested in this GH ticket #9468. However, our implementation did not cover all the use cases and WMAgent was setting that parameter for source modules that did not support it, thus raising a Configuration exception.

Solution: here is the second PR #9660 where an explicit check for the source type has been added. With this, we do not expect any more Configuration problems involving this parameter. This feature/fix is only available starting in WMAgent 1.3.3 releases.

Impacted dates: It affects any workflows that had their sandboxes created between 22/april/2020 and 24/april/2020. So, if an agent keeps pulling WQE down for the same workflow, during a long time, it will likely keep hitting the same problem (sandboxes have not been patched).

CMSWEB central services running in the VM-based and in Kubernetes in parallel (workqueue duplication)

Problem: during the commissioning of the Kubernetes infrastructure for CMSWEB services, single instance services like Global Workqueue, ReqMgr2MS and some cherrypy threads (ReqMgr2/WMStats) got deployed and enabled in the k8s production infrastructure, pointing to the production CouchDB database. That means that, many workflows could potentially have been acquired by 2 instances of the Global Workqueue, which can result in duplicate workqueue elements causing higher statistics/events/lumis in the output datasets. We have also seen a higher value for the estimated number of lumis and events (from the ReqMgr2 document)..

Solution: at this stage, there is nothing that can be done but to deal with those problematic workflows as we find them.

Impacted dates: Those services were running in the k8s-prod infrastructure between July 14-17, and from July 28 to Aug 1st, 2020. Here is a list of 1830 unique workflows that could have possibly been affected: workflows-dup-k8s-prod