Job queues - sul-dlss/preservation_catalog GitHub Wiki

Diagnosis

Very deep queues

A lengthy backlog for "validate_moab" is problematic. This indicates that there is an issue which is blocking accessioning and should be investigated.

A lengthy backlog for other other queues is to be expected. This is because a large number of jobs may be added to the queue at the same time. (See schedule.rb). However, Sidekiq should make steady progress on a lengthy backlog, even if completing the backlog takes multiple days.

Jobs are getting errors that seem impossible, given the current state of the code

Go to the worker VMs, and run ps -ef | grep kiq. If you notice any stale worker management processes, kill them so that only current code is executing. Deployments, on rare occasion, are unsuccessful at rotating old worker processes for new ones, and the old worker processes with out of date code will pick up jobs.

Sidekiq configuration

PresCat uses an approach to Sidekiq configuration that provides fine control over workers, threads, and queues:

  • The number of worker processes for a server is specified in that server's puppet configuration.
  • Each worker process has its own Sidekiq configuration file, named with the number of the worker. For example, if there are 2 worker processes, then there will be a sidekiq1.yml and a sidekiq2.yml.
  • Sidekiq configuration files are stored in shared_configs.
  • Each Sidekiq configuration file specifies the number of threads and the queues that are serviced. For example:
---
:concurrency: 3
:queues:
  - validate_moab
  • Each Sidekiq worker process has multiple worker threads (concurrency) for working the designated queues.
    • A worker thread is what actually picks up a job from a queue to perform the work.

It should be expected to adjust Sidekiq configuration over time based on usage.

Aside: for Resque, a worker management process coordinates many separate worker processes, with each Resque worker process picking up jobs from a designated group of queues.