new metadata regeneration tasks do not start (bsc#961002) - uyuni-project/uyuni GitHub Wiki

Intro

Bug reference: https://bugzilla.suse.com/show_bug.cgi?id=961002

Description: new metadata generation tasks do not start in Taskomatic (other tasks will start just fine).

Reproducer

schedule an mgr-sync refresh on a test host: mgr-sync refresh --refresh-channels --schedule
take a look at top, when res6 or res7 is being reposynced try to schedule an unrelated job also from the WebUI, eg. auto-errata: https://$HOST/rhn/admin/BunchDetail.do?label=auto-errata-bunch
you get Single run of auto-errata-bunch bunch was scheduled for DATE but no run appears in the table below, and actually Taskomatic won't start any new task until the reposync is finished.

Digging: from Manager to Taskomatic to quartz

I wrote a tool to see what Taskomatic is doing and tried to schedule 3 auto-errata tasks to see what happened.

Normally, they get executed right away:

images/bsc961002-taskotop-normal.png

When reposync on RES 6 is happening they are simply not there:

images/bsc961002-taskotop-failure.png

So apparently, those runs are scheduled correctly but they are not picked up. I could further make sure that's the case by attaching a debugger to Taskomatic, and stepping into both Taskomatic and quartz sources (by attaching them in Eclipse).

In order to further make sure relevant tables got rows INSERTed correctly, I planted some custom triggers and a logging table (here shown in Oracle flavor):

-- table in which to log "interesting events"
CREATE TABLE moio_log (
  id NUMBER(10) NOT NULL PRIMARY KEY,
  label VARCHAR2(200 BYTE), -- Tasko job label
  teibo VARCHAR2(200 BYTE), -- table name
  ts TIMESTAMP(6)
);

-- implement autoincrement for moio_log.id (does not exist before Oracle 12c)
CREATE SEQUENCE moio_log_seq;
CREATE OR REPLACE TRIGGER moio_log_pk 
BEFORE INSERT ON moio_log FOR EACH ROW
BEGIN
  SELECT moio_log_seq.NEXTVAL
  INTO :new.id
  FROM DUAL;
END;

-- log insertions in qrtz_job_details (new single schedule => new quartz trigger => new row in this table)
CREATE OR REPLACE TRIGGER moio_qrtz_job_details_log
AFTER INSERT ON qrtz_job_details FOR EACH ROW
BEGIN
  INSERT INTO moio_log(label, teibo, ts)
    VALUES (:new.job_name, 'qrtz_job_details', SYSDATE);
END;

-- log insertions in rhntaskoschedule (new single schedule => new row in this table)
CREATE OR REPLACE TRIGGER moio_rhntaskoschedule
AFTER INSERT ON rhntaskoschedule FOR EACH ROW
BEGIN
  INSERT INTO moio_log(label, teibo, ts)
    VALUES (:new.job_label, 'rhntaskoschedule', SYSDATE);
END;

Result: jobs are scheduled correctly and tables actually get rows INSERTed as expected.

images/bsc961002-moio_log-sqldeveloper.png

So from the Manager POV all is fine, Taskomatic has scheduled the job.

From the Taskomatic POV all is fine, quartz has created the trigger.

From the quartz POV all is fine as far as the trigger creation is concerned, but the trigger is not picked up.

Digging more: quartz internals

What is the scheduler doing, instead of picking up jobs? Let's take a look at a thread dump:

3XMTHREADINFO      "DefaultQuartzScheduler_QuartzSchedulerThread"
[...]
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at java/lang/Object.wait(Native Method)
4XESTACKTRACE                at java/lang/Object.wait(Object.java:196(Compiled Code))
4XESTACKTRACE                at org/quartz/simpl/SimpleThreadPool.blockForAvailableThreads(SimpleThreadPool.java:424)

As the method name implies, blockForAvailableThreads will spinlock until a worker thread is free.

WAT-moment: we have 20 workers, only 3 doing something but according to the scheduler none is free!

What are threads actually doing? In most cases...

3XMTHREADINFO      "DefaultQuartzScheduler_Worker-10"
[...]
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at java/lang/Object.wait(Native Method)
4XESTACKTRACE                at java/lang/Object.wait(Object.java:167)
4XESTACKTRACE                at com/redhat/rhn/taskomatic/TaskoJob.execute(TaskoJob.java:136)

...waiting at the Taskomatic level for other tasks to complete.

Explanation of the bug

In general, multiple Taskomatic runs of the same task may happen in parallel. In principle, for example, one could allow two or three CobblerSyncTask to happen in parallel (how many is configurable in rhn.conf via the "taskomatic.<QUALIFIED_CLASS_NAME>.parallel_threads = <N> parameter).

Actually, the default is 1 for all tasks. If this "limit" is reached, new runs of the same task will for others to finish.

Problem is, this waiting is not implemented at the Quartz level, but at the Taskomatic level, inside of the Quartz execute() method - that means waiting happens in a Quartz trigger execution environment, that is a Quartz thread. In other words, waiting will keep one Quartz worker thread busy.

We only have 10 of them, and each RepoSyncTask consumes one.

manager.suse.de has 133 channels, most of which are SUSE channels refreshed nightly in one shot.

133 > 10.

Baboom - idle RepoSync tasks hog all our Taskomatic workers and no other tasks are scheduled until they are finished.

Proposed solution

Currently RepoSyncTask has a "parameter" (actually a Java object serialized and saved into the DB as BLOB in rhnTaskoSchedule) that specifies a channel ID.

Generalize RepoSyncTask to accept a list of channel IDs, and schedule them in bulk when that makes sense.

Reposync them one after another in one only Quartz worker thread.