Ceph Errors - sul-dlss/preservation_catalog GitHub Wiki

Ceph Errors

Symptoms

HoneyBadger "Errno::EACCES: Permission denied @ rb_sysopen - /pres-0n/sdr2objects/ ..." (which could happen in DorServicesApp as part of "ShelveJob" step for accessioning, or in PreservationCatalog as part of update-moab step.)
"Preservation::Client.http_response got 500" in other parts of SDR.
Someone who does accessioning complains that it's taking an inordinately long time for a newly accessioned version to finish going through the pipeline. Most objects that are 100 MB or less should take no more than a minute or two to be processed by any given accessionWF or preservationIngestWF. Similarly, an object that is 10+ GB may take take tens of minutes at any given step involving preservation storage IO, and it would be typical for a very large object (e.g. a 1 TB media object) to take many hours to a day+.
Preservation storage related jobs (e.g. UpdateMoab on preservation robots, or validate_moab on preservation catalog) are taking an inordinately long time (e.g. an UpdateMoab or validate_moab job for a 100 MB Moab has been running for 2 hours).
Ops notices Ceph slow metadata service (MDS) request alerts, urging us to look for one of the above manifestations of the issue.

Diagnosing

./bq499mh5981/v0010/data/metadata:
ls: cannot access ./bq499mh5981/v0010/data/metadata/workflows.xml: Permission denied
ls: cannot access ./bq499mh5981/v0010/data/metadata/events.xml: Permission denied
ls: cannot access ./bq499mh5981/v0010/data/metadata/versionMetadata.xml: Permission denied
total 3
-rw-r--r-- 1 pres pres 2138 May 28 09:37 descMetadata.xml
-????????? ? ?    ?       ?            ? events.xml
-rw-r--r-- 1 pres pres  263 May 28 09:37 provenanceMetadata.xml
-????????? ? ?    ?       ?            ? versionMetadata.xml
-????????? ? ?    ?       ?            ? workflows.xml

In addition, you can work with ops or Andrew to determine whether any file IO is actually happening, either by using Linux CLI tools to examine disk and CPU activity of relevant worker processes; in the case of operations like TransferObject and UpdateMoab, you might also use du to see if bytes are still being written to the target storage.

What To Do

Wait for Ceph to recover and/or a jam to resolve (at least 15 minutes).
Ask Ops to see if the cluster looks healthy overall and/or restart the Ceph metadata service. After the restart, watch the queues and determine per the above guidance whether the blockage has been resolved. Sometimes an MDS failover or the targeted termination of a worker process will clear up the jam entirely.
Kill worker processes for stuck jobs. (For the robots, these will be Resque workers; for PresCat, these will be Sidekiq workers.)
- After killing workers, check for corruption if the process killed is a pres robots worker performing UpdateMoab or TransferObject. The corrupted Moabs will fail checksum validation in pres cat, and will likely also have an error at one of the preservationIngestWF steps. The workflow grid (esp failures at transfer-object, update-moab, or validate-moab in preservationIngestWF; but also errors in preservationAuditWF) is a good indicator of which objects need attention and possibly manual remediation. There may also be complaints about a mismatch between expected and actual versions.

Sometimes an MDS failover or the targeted termination of a worker process will clear up the jam entirely. Sometimes it will clear things for a few minutes, only for things to get stuck again. You'll have to watch the Resque consoles for a few minutes after either of the above measures and determine which is the case.

What to do if it's determined that rebooting pres robots and/or pres cat is necessary

Warn #dlss-aaas that infra and ops are aware that things are stuck, and are working to resolve the issue. ask that people please refrain from further accessioning until the all-clear is given.
Turn off Google books retrieval.
Do a graceful shutdown of resque-pool master (likely pres robots, maybe also pres cat). You can do this using bundle exec cap resque:pool:stop from the directory of the applicable project on your laptop.
Terminate any stuck worker processes that remain after stopping resque-pool (maybe pres robots, maybe pres cat -- either way, note stuck druids and age of hung jobs, even if only in Slack discussion about the issue, as this may be useful for later follow up, both when auditing for data corruption and when looking at logs to try to home in on the underlying Ceph issue that we've not yet figured out).
Give the all clear to ops to reboot (likely just pres robots VM, maybe also pres cat VMs, but not the pres cat redis VM).
Wait for ops to indicate completion of VM reboot(s).
Re-enable the pres cat workers, let them work off any backlog. (bundle exec cap resque:pool:hot_swap from the pres cat project directory on your laptop)
Re-enable pres robots workers (same hot_swap command, but from pres robots' directory), let them work off any backlog (or at least a significant portion of the backlog, if it's very large, e.g. if preservationIngestWF was stuck for a whole day).
Keep an eye on the Argo workflow grid for errors.
If it appears that things are flowing normally again for the moment (e.g. if they've run without sticking for 30 minutes or so), give the all clear (for now) to #dlss-aaas to start accessioning again.
Run checksum validation audits and replication audits for objects accessioned within a day before the first reboot through the day after the last reboot, if there were multiple reboots within a few days before auditing was done. You can see an example of how to do this en masse from a text file generated from an appropriate Argo facet by looking at this comment and this comment. Note that you may have to lightly hand-edit the Argo facet URL to get exactly the date range you want for your results (note also that the Argo search uses UTC as its time zone, but you're probably thinking about this in terms of pacific time, and that's probably what many other systems use for their logs).