IO against Ceph backed preservation storage is hanging indefinitely (steps to address IO problems, and follow on cleanup) - sul-dlss/preservation_catalog GitHub Wiki

Restarting virtual machines to deal with indefinitely stuck Ceph IO; auditing for and remediating any resulting corruption

Starting in late June 2022, we've experienced an issue where most reads and/or writes to Ceph-backed preservation storage will hang indefinitely, neither throwing errors nor completing. See e.g. https://github.com/sul-dlss/preservation_catalog/issues/1896 and https://github.com/sul-dlss/operations-tasks/issues/3100

This page is meant to be a guide to help:

  1. determine whether that problem is occurring at any given moment
  2. alleviate the problem so that preservationIngestWF and thus accessionWF can resume (i.e. "unwedge accessioning")
  3. find and remediate any data corruption that may have been introduced

Determining whether this is necessary

The two most common ways that we become suspicious that this problem is occurring are:

  • Someone who does accessioning complains that it's taking an inordinately long time for a newly accessioned version to finish going through the pipeline.
  • An infrastructure developer or ops engineer who is watching the robot Resque console or preservation catalog's Resque console notices that preservation storage related jobs (e.g. UpdateMoab on preservation robots, or validate_moab on preservation catalog) are taking an inordinately long time (e.g. an UpdateMoab or validate_moab job for a 100 MB Moab has been running for 2 hours).

Additionally, sometimes ops will notice Ceph slow MDS request alerts, urging us to look for one of the above manifestations of the issue.

So how do I know if things are really stuck?

tip: if you're babysitting Resque workers to determine whether things are stuck/unstuck, you may want to turn on the "Live Poll" mode. There's a link in the bottom right corner of the Resque web console for this. It'll auto-refresh the page every 2 seconds, so you don't have to constantly manually refresh. Note that it will eventually fail to keep refreshing when your webauth session times out; when this happens, refresh the page manually, acknowledge 2FA if need be, and hit "Live Poll" again once the console page has reloaded.

Sometimes it does legitimately take a long time for a large object to be processed. We don't, unfortunately, have systematically collected stats or detailed expectations about throughput overall or processing time for a single object given a number of files and/or a total file size. But, for example, most objects that are 100 MB or less should take no more than a minute or two to be processed by any given accessionWF or preservationIngestWF step (once picked up! it is of course possible that there's a long backlog of work during a period of heavy activity, leaving many objects in the queue being unworked at any given time). Similarly, it would not be surprising for an object that is tens of GB to take tens of minutes at any given step involving preservation storage IO, and it would be typical for a very large object (e.g. a 1 TB media object) to take many hours (even a day or two might be reasonable.

One thing you can do to help determine whether IO is truly hung is to work with ops or Andrew to determine whether any file IO is actually happening, either by using Linux CLI tools to examine disk and CPU activity of relevant worker processes; in the case of operations like TransferObject and UpdateMoab, you might also use du to see if bytes are still being written to the target storage.

An additional confounding factor: we've recently seen a situation where many objects, often sharing the first couple characters of their respective druids (e.g. kj123bc4567 and kj456df7890), will get stuck on an IO heavy step (e.g. validate_moab in pres cat or UpdateMoab in pres robots) for 15 minutes, at which point the traffic jam will clear on its own, and the workflows will continue moving.

First things to try if it seems things are actually stuck

If none of the above (large objects taking long time to process as expected, objects sitting in a deep queue, a self-clearing 15-ish minute jam) seem to apply, it is likely that things are actually stuck. If possible, consult with ops or Andrew to see if someone who has dealt with the issue before agrees that things are stuck based on prior first-hand experience.

Before trying to deal with the problem by rebooting, which is the most disruptive action and the one most likely to introduce corruption, but also the one most likely to get things moving if they are truly stuck, you can first try:

  • MDS failover -- this is something that you must request from ops. They will failover one or more Ceph MDS servers that are being slow to respond. This may completely alleviate the issue, in which case it is unlikely that data corruption will be introduced. After the failover(s), watch the queues and determine per the above guidance whether the blockage has been resolved.
  • killing Resque worker processes for stuck jobs -- you may manually do this from the Linux command line on which the worker process is running. You'd want to use either kill -9 <PID of the child Resque process doing the work> or kill -s TERM <PID of the parent process that forked off the process that's actually doing the work>. Info about how Resque processes interpret signals here. You can find parent process IDs by looking at the number after the colon on the "Where" column of Resque console for the job you're curious about. You can get the child process ID by doing ps -ef | grep resque from the command-line of the worker VM and finding the ID of the process that the aforementioned parent forked. NOTE: at present there is one pres robots worker VM, but there are 3 pres cat worker VMs -- the "Where" column on Resque console should tell you what host the job you care about is running on.
    • This may introduce data corruption regardless of the worker process being killed, and will almost certainly introduce data corruption if the process killed is a pres robots worker performing UpdateMoab or TransferObject. If the worker process that was terminated was a validate_moab job in pres cat, you can retry it from Resque console. Because the robot system has a bit more indirection and management over robots Resque instances, any pres robots jobs that are killed should be retried via the workflow UI in Argo or by manually resetting workflow steps from the workflow-server-rails rails console. More detailed info on that and the rest of remediation is linked later in this document.

Sometimes an MDS failover or the targeted termination of a worker process will clear up the jam entirely. Sometimes it will clear things for a few minutes, only for things to get stuck again. You'll have to watch the Resque consoles for a few minutes after either of the above measures and determine which is the case.

What to do if it's determined that rebooting pres robots and/or pres cat is necessary

  1. Warn #dlss-aaas that infra and ops are aware that things are stuck, and are working to resolve the issue. ask that people please refrain from further accessioning until the all-clear is given.
  2. Turn off Google books retrieval.
  3. Do a graceful shutdown of resque-pool master (likely pres robots, maybe also pres cat). You can do this using bundle exec cap resque:pool:stop from the directory of the applicable project on your laptop.
  4. Terminate any stuck worker processes that remain after stopping resque-pool (maybe pres robots, maybe pres cat -- either way, note stuck druids and age of hung jobs, even if only in Slack discussion about the issue, as this may be useful for later follow up, both when auditing for data corruption and when looking at logs to try to home in on the underlying Ceph issue that we've not yet figured out).
  5. Give the all clear to ops to reboot (likely just pres robots VM, maybe also pres cat VMs, but not the pres cat redis VM).
  6. Wait for ops to indicate completion of VM reboot(s).
  7. Re-enable the pres cat workers, let them work off any backlog. (bundle exec cap resque:pool:hot_swap from the pres cat project directory on your laptop)
  8. Re-enable pres robots workers (same hot_swap command, but from pres robots' directory), let them work off any backlog (or at least a significant portion of the backlog, if it's very large, e.g. if preservationIngestWF was stuck for a whole day).
  9. Keep an eye on the Argo workflow grid for errors.
  10. If it appears that things are flowing normally again for the moment (e.g. if they've run without sticking for 30 minutes or so), give the all clear (for now) to #dlss-aaas to start accessioning again.
  11. Run checksum validation audits and replication audits for objects accessioned within a day before the first reboot through the day after the last reboot, if there were multiple reboots within a few days before auditing was done. You can see an example of how to do this en masse from a text file generated from an appropriate Argo facet by looking at this comment and this comment. Note that you may have to lightly hand-edit the Argo facet URL to get exactly the date range you want for your results (note also that the Argo search uses UTC as its time zone, but you're probably thinking about this in terms of pacific time, and that's probably what many other systems use for their logs).

Remediating data corruption introduced by killing stuck Resque worker processes and/or reboots of preservation related VMs

Remediation of single Moabs and bulk remediation of many Moabs simultaneously is covered in this wiki page: validate moab step fails during preservationIngestWF

The corrupted Moabs will fail checksum validation in pres cat, and will likely also have an error at one of the preservationIngestWF steps. The workflow grid (esp failures at transfer-object, update-moab, or validate-moab in preservationIngestWF; but also errors in preservationAuditWF) is a good indicator of which objects need attention and possibly manual remediation. There may also be complaints about a mismatch between expected and actual versions.