ReqMgr2 MicroService Monitor - dmwm/WMCore GitHub Wiki

This MicroService is supposed to monitor the input data placement made by the Transferor MicroService, and depending on the transfer status and its completion, it has to make a request status transition (to staged), which would then allow Global WorkQueue to fetch those requests and proceed with the chunk of work and workqueue elements creation.

This module is - at the moment - set as a thread of the Unified/Transferor MicroService. In the future we could allocate it to its own process and increase the overall performance of the service.

Now talking about tasks and steps that this Monitor MS has to take, we can order them as (still not a very exhaustive and detailed description!!!):

  1. fetch all the workflows in the staging status in ReqMgr2
  2. fetch all the workflow transfer documents (in bulk) from ReqMgr AuxDB
  • filter transfer documents according to the list of workflows in the staging status
  • and filter them (the ones already filtered out) once again according to the last time they were looked at (so looking at the timestamp/lastUpdate value). If the lastUpdate was smaller than X hours (let's call it 6h for now), then we skip that transfer document, otherwise we add it to the list of transfers to be updated.
  1. fetch all the campaign documents (in bulk) from ReqMgr AuxDB
  2. using the transfer ids available in the transfer document (under the transfers key), make calls to PhEDEx/Rucio in order to get the status of those transfers; then calculate the transfer status completion. There are a few possible cases here:
  • IF there are no transfers under the workflow transfer doc - thus no input data at all - move the workflow to staged
  • IF ALL transfers type (primary, secondary, etc) are above the minimum completion thresholds - as defined in the campaign - then a) update the transfer doc and b) update the request status to staged
  • IF NOT ALL transfers are above the minimum completion thresholds, then update the transfer doc with the new completion value and the lastUpdate value; then proceed to the next workflow.

This algorithm is supposed to run every 15min or so. We might also want to limit the amount of workflows considered in every cycle.

Open questions

Do we want to monitor the subscriptions and act upon issues and/or stuck transfers? Or we just assume transfers will eventually succeed? Alerts have to be created for bad input placement (bad transfers) as well.