More than the expected number of Resque workers are running too many resque workers worker count high - sul-dlss/preservation_catalog GitHub Wiki

aka the feature-worker-count: FAILED TOO MANY WORKERS okcomputer check and nagios alert.

a little background on resque-pool's worker management and restart behavior

When re-starting (as happens when deploying Preservation Catalog), resque-pool hotswap attempts to wind down all jobs in progress gracefully, by issuing a "gentle" kill to the current resque-pool master process (kill -s QUIT). This causes that resque-pool master process to

  • shut down all idle workers in the current pool
  • signal all of its busy workers to exit once they have finished their WIP
  • exit once all of the worker processes managed by that resque-pool instance have exited

Once the shutdown command has been issued to the current resque-pool master process, a new resque-pool instance is started (the expectation being that most workers are idle at any given moment, so there should not be many old workers for long, so we should not run into under-resourcing on the VM).

investigating, pruning stale workers, detecting bad pool shutdowns and zombie workers

As such, it's frequently the case that a long running job or two from the prior resque-pool instance will still be going for a bit after all the workers of the new pool are up (e.g. a few minutes for a large replication upload, a few hours for a large checksum validation job). So it is often the case that Nagios will alert about "TOO MANY WORKERS" for a few minutes/hours after a deployment.

If the alert lingers for more than a couple hours, you can investigate by running ps -ef | grep resque on each of the worker VMs. You should see two instances of resque-pool master, with only one or a few workers for the old master process, and the full complement for the new one. If you see the full complement of workers for the old resque-pool master process, that may indicate that the pool wasn't stopped correctly, and you may have to manually kill the stale workers and pool master from the VM command-line (this only tends to happen once or twice a year on prod as of 2021). If top and/or other resource usage diagnostic tools indicate that workers in an old pool are up but not doing anything (possibly in some zombie state?), you may have to kill those stale workers manually.

Old pools will usually shut down gracefully, but as of 2021, we'd expect a hiccup in restart every... few weeks maybe?

the above determines that the right number of workers are actually running

Finally, there is a very unusual situation where the resque-pool instances are all running correctly, and ps across the worker VMs indicates that the worker and resque-pool master processes have the expected counts, with no stale worker processes... but the worker count is still too high according to Resque web console and okcomputer/nagios. In this case, it's possible that resque-pool has stale worker information cached in Redis. You can remedy this by pulling up a rails console on any VM for the pres cat instance in question, and doing as follows:

Resque.workers.map(&:prune_dead_workers)

This situation can arise when a pool is killed very abruptly, as when a VM is rebooted while work is in progress without first stopping the pool.

For more context, see: