Disaster History Log - internetarchive/openlibrary GitHub Wiki
This is a legacy document. In 2022, we migrated our process to use Postmortem Issues on github.
When someone goes wrong/unplanned on openlibrary.org, it’s recommended to open a new issue of Type: Post-Mortem (using the issue template -- which should automatically set Type: Post-Mortem and Priority: 0 and to make sure yourself or someone else is the assignee as the designated notetaker and/or person responsible leading effort. Please also specify an Affects: label to help others search and/or get up to speed.
We used to track brief notes about outages and ops issues on https://github.com/internetarchive/openlibrary/wiki/Disaster-History-Log. The goal was a searchable log of past events (in case of a repeat emergency). On 2019-12-27 @cdrini + @mekarpeles decided to switch to using github issues with Type: Post-Mortem labels to document issues as they are happening (and to post retro reflections).
Outage scenarios in descending order of occurrence:
2019-12-27 15:15 PT: ol-web3
and ol-web4
failing fetch betterworldbooks prices
BWB switched their product API (e.g. 'https://products.betterworldbooks.com/service.aspx?ItemId=9780142410752') to only use https (for PCI compliance reasons) and this was breaking Open Library and urllib2
. The issue was documented in https://github.com/internetarchive/openlibrary/pull/2779 and we solved by switching to requests
. When deployed, this still didn't fix on production. The solution was to upgrade pip
on production machines and installing the latest pyopenssl
and cryprography
.
2019-03-28 (and the week thereof): Overloaded ElasticSearch https://docs.google.com/document/d/1toBEk01cJ0uUyfOFIxrASlFzX9aPIML3vdj4WAXrxyY/edit# Open Library was experiencing unresponsive web heads (ol-web3, ol-web4) via haproxy. The reason for the outage was that recent changes have been made with Archive.org Elastic Search so that clients who make many concurrent requests get demoted and queued, so that these requests do not take priority over users making individual requests. It's hypothesized that Open Library (as a whole) is somehow being treated as a single client (instead of Open Library's clients being passed through and treated as individual end-users) and presumably this is resulting in Open Library being penalized for its high volume of searches. The end result is ol-web3 and ol-web4 search requests queue are being queued to the point of the Open Library service becoming unresponsive. Sam was able to discover the issue by attaching strace
to a single thread of ol-home
and notice the process was hanging on search dispatch to Archive.org Elastic Search. The solution required unblocking Elastic Search on Archive.org's side. On the ol-web3, debug went like:
# Update `/olsystem/bin/upstart-service` so “function openlibrary-server” has NUM_WORKERS=1
$ sudo supervisorctl stop openlibrary # stop gunicorn
$ sudo netstat -ntlp # verify nothing listening on 7071 (otherwise kill listening pid)
$ sudo supervisorctl start openlibrary # start gunicorn in single thread mode
$ sudo strace -tt -s 500 -p $pid_of_gunicorn
# Use curl to see what's happening and then audit the strace
# revert/checkout `/olsystem/bin/upstart-service` back to normal when finished
2019-09-12 19:59 - 20:52 ET: ol-web3:/var/log/openlibrary/upstart.log
exploding in size possibly due to archive.org serving 503s ; some user disruption due to archive.org unavailability
* 19:59: abazella notified on slack about ol-web3:/var/log/openlibrary
filling up
* 20:17: Drini investigated and found upstart.log
was 188GB, and was spitting out:
2019-09-13 00:18:43 [28300] [openlibrary.core.lending] [ERROR] POST failed ... HTTPError: HTTP Error 503: Service Temporarily Unavailable
* 20:26: Linked to archive.org outage
* 20:44: Drini (with buy-in from Mek) truncated the log file: sudo truncate upstart.log --size 0
* Conclusion: It looks like ol-web3
is not rolling the log file correctly; that should be investigated.
-
- Situation:
ol-www1
was getting clobbered shortly after syncing it in prep for migration from Funston to Richmond (but this was a red herring). Spikes of traffic were forcing haproxy to 503 w/ haproxy status sQ - Tried:
- Restarting nginx
- Restarting haproxy (helped for a short period)
- Updated haproxy.conf to only log errors (this was helpful; helped us identify slow API calls)
- Updated haproxy.conf to increase process count for
ol-web3
,ol-web4
(endpoints quickly became saturated again)
- Resolution:
- Identify 503 sQ using tail -f /var/log/haproxy.log on ol-www1
- Run:
sudo tail -n 5000 /var/log/nginx/access.log | cut -d ' ' -f 1 | sort | uniq -c | sort -n | tail -n 10 | /home/samuel/work/reveal-abuse/reveal | /home/samuel/work/reveal-abuse/shownames
- See also: https://git.archive.org/mek/detect-abuse
- Blocking bad actors on
ol-www1
nginx
in/olsystem/etc/nginx/deny.conf
- Moving forward @jonah will help us implement haproxy sticktables to ban IPs for periods as use becomes abusive/disruptive
- Situation:
-
- Situation: Packet-flapping on the baremetal (running nearly all of OL's services) took down Open Library services -- see: https://internetarchive.slack.com/archives/C06RP0F6E/p1492465473543927
- Resolution: jonah reset ports switch side, host came back
-
- Situation: ACS4 went down, Nagios alerts triggered, Open Library borrowing failed
- Resolution:
- Brenton restarted ACS4 via Ganeti (mek didn't have access)
- It took ~10 minutes for nagios to catch up on alerts
- Mek restarted
ol-web3
andol-web4
(as otherwise loans were failing) - Outage recovery was added to status.archivelab.org
- Brenton restarted ACS4 via Ganeti (mek didn't have access)
- Learnings:
- We need a better guide for ACS outages:
- More people need access to
ganeti
andol-acs4
- Resolution(s):
- Mek now has access to ACS + ganeti
-
- Situation: Cover service was down
- Tried:
- Restarting cover store
- Restarting just
ol-web3
(a la blue green deploy) via supervisord - Looked at logs, noticed problem starting gunicorn on
ol-web3
- Resolution: https://internetarchive.slack.com/archives/openlibrary/p1487790396002096
- Thanks: Sam
-
- Situation: Scheduled Maintenance w/ Trevor
- Problems: Servers which were restarted came back up w/ permission problems (/var/run/openlibrary missing)
- Resolution:
- we added server restart guide to deployment in Deployment-Scratchpad#handling-server-reboot
- we added docs for checking other services (e.g. solr-updater) which may go down in Deployment-Scratchpad#verify-deployment
- Questions:
- Why is this happening? Is some startup process nuking /var/run/openlibrary? https://internetarchive.slack.com/archives/openlibrary/p1487113203001313
- Thanks: Brenton
-
- Situation: Deploying IA/OL XAuth bridge, deployment restart failed (
ol-web3
,ol-web4
) - Thanks: Sam, Mark, Jim, Brenton
- Problem: Deployment process was not robust against new pip dependencies (in this case LEPL for email validation)
- Learnings:
- Changing symlinks must be done on all services (
ol-web3
,ol-web4
, etc) - Do not reset history in git as our deploy process looks only at the most recent hash it knows about!
- Changing symlinks must be done on all services (
- Resolution: we added blue-green deployment to our docs + guide on adding dependencies in Deployment-Scratchpad#satisfying-dependency-changes
- Situation: Deploying IA/OL XAuth bridge, deployment restart failed (
-
- Situation: DNS problems reported here, debugged here and
- Resolution: solved here -- we started documenting our deployment process
-
- Situation: DNS problem, resolved itself, recounted here
-
- Situation: DNS problems, resolved itself, recounted here