Disaster History Log - internetarchive/openlibrary GitHub Wiki

Post-Mortems

This is a legacy document. In 2022, we migrated our process to use Postmortem Issues on github.

How-To

When someone goes wrong/unplanned on openlibrary.org, it’s recommended to open a new issue of Type: Post-Mortem (using the issue template -- which should automatically set Type: Post-Mortem and Priority: 0 and to make sure yourself or someone else is the assignee as the designated notetaker and/or person responsible leading effort. Please also specify an Affects: label to help others search and/or get up to speed.

History

We used to track brief notes about outages and ops issues on https://github.com/internetarchive/openlibrary/wiki/Disaster-History-Log. The goal was a searchable log of past events (in case of a repeat emergency). On 2019-12-27 @cdrini + @mekarpeles decided to switch to using github issues with Type: Post-Mortem labels to document issues as they are happening (and to post retro reflections).

Outage scenarios in descending order of occurrence:

Failing BWB Price Lookups

2019-12-27 15:15 PT: ol-web3 and ol-web4 failing fetch betterworldbooks prices

BWB switched their product API (e.g. 'https://products.betterworldbooks.com/service.aspx?ItemId=9780142410752') to only use https (for PCI compliance reasons) and this was breaking Open Library and urllib2. The issue was documented in https://github.com/internetarchive/openlibrary/pull/2779 and we solved by switching to requests. When deployed, this still didn't fix on production. The solution was to upgrade pip on production machines and installing the latest pyopenssl and cryprography.

OpenLibrary Blacklisted by ElasticSearch

2019-03-28 (and the week thereof): Overloaded ElasticSearch https://docs.google.com/document/d/1toBEk01cJ0uUyfOFIxrASlFzX9aPIML3vdj4WAXrxyY/edit# Open Library was experiencing unresponsive web heads (ol-web3, ol-web4) via haproxy. The reason for the outage was that recent changes have been made with Archive.org Elastic Search so that clients who make many concurrent requests get demoted and queued, so that these requests do not take priority over users making individual requests. It's hypothesized that Open Library (as a whole) is somehow being treated as a single client (instead of Open Library's clients being passed through and treated as individual end-users) and presumably this is resulting in Open Library being penalized for its high volume of searches. The end result is ol-web3 and ol-web4 search requests queue are being queued to the point of the Open Library service becoming unresponsive. Sam was able to discover the issue by attaching strace to a single thread of ol-home and notice the process was hanging on search dispatch to Archive.org Elastic Search. The solution required unblocking Elastic Search on Archive.org's side. On the ol-web3, debug went like:

# Update `/olsystem/bin/upstart-service` so “function openlibrary-server” has NUM_WORKERS=1 
$ sudo supervisorctl stop openlibrary  # stop gunicorn
$ sudo netstat -ntlp  # verify nothing listening on 7071 (otherwise kill listening pid)
$ sudo supervisorctl start openlibrary  # start gunicorn in single thread mode
$ sudo strace -tt -s 500 -p $pid_of_gunicorn
# Use curl to see what's happening and then audit the strace
# revert/checkout `/olsystem/bin/upstart-service` back to normal when finished

Upstart Log Massively Large

2019-09-12 19:59 - 20:52 ET: ol-web3:/var/log/openlibrary/upstart.log exploding in size possibly due to archive.org serving 503s ; some user disruption due to archive.org unavailability * 19:59: abazella notified on slack about ol-web3:/var/log/openlibrary filling up * 20:17: Drini investigated and found upstart.log was 188GB, and was spitting out: 2019-09-13 00:18:43 [28300] [openlibrary.core.lending] [ERROR] POST failed ... HTTPError: HTTP Error 503: Service Temporarily Unavailable * 20:26: Linked to archive.org outage * 20:44: Drini (with buy-in from Mek) truncated the log file: sudo truncate upstart.log --size 0 * Conclusion: It looks like ol-web3 is not rolling the log file correctly; that should be investigated.

Older History

  • 2019-03-29 5:00pm PST OL requests for archive.org ES getting queued until stopped

  • 2017-11-09 10:00pm PST

    • Situation: ol-www1 was getting clobbered shortly after syncing it in prep for migration from Funston to Richmond (but this was a red herring). Spikes of traffic were forcing haproxy to 503 w/ haproxy status sQ
    • Tried:
      • Restarting nginx
      • Restarting haproxy (helped for a short period)
      • Updated haproxy.conf to only log errors (this was helpful; helped us identify slow API calls)
      • Updated haproxy.conf to increase process count for ol-web3, ol-web4 (endpoints quickly became saturated again)
    • Resolution:
      • Identify 503 sQ using tail -f /var/log/haproxy.log on ol-www1
      • Run: sudo tail -n 5000 /var/log/nginx/access.log | cut -d ' ' -f 1 | sort | uniq -c | sort -n | tail -n 10 | /home/samuel/work/reveal-abuse/reveal | /home/samuel/work/reveal-abuse/shownames
      • Blocking bad actors on ol-www1 nginx in /olsystem/etc/nginx/deny.conf
      • Moving forward @jonah will help us implement haproxy sticktables to ban IPs for periods as use becomes abusive/disruptive
  • 2017-03-17 2:30pm PDT

  • 2017-04-01 2:00pm PDT

  • 2017-02-22 5:00am PST, 9:45am - 11:00am PST

  • 2017-02-13 6:00pm PST - 6:30pm PST

  • 2017-02-10 ~5:00pm PST - 5:30pm PST

    • Situation: Deploying IA/OL XAuth bridge, deployment restart failed (ol-web3, ol-web4)
    • Thanks: Sam, Mark, Jim, Brenton
    • Problem: Deployment process was not robust against new pip dependencies (in this case LEPL for email validation)
    • Learnings:
      • Changing symlinks must be done on all services (ol-web3, ol-web4, etc)
      • Do not reset history in git as our deploy process looks only at the most recent hash it knows about!
    • Resolution: we added blue-green deployment to our docs + guide on adding dependencies
  • 2016-12-23 11:50pm PST - 2016-12-24 6:30am PST

  • 2016-12-22 12:30pm PST - 2016-12-22 13:20pm PST

  • 2016-12-02 6:35pm PST - 2016-04-02 7:20pm PST