Adding Archiving to Scraper checklist - everypolitician/everypolitician GitHub Wiki
Checklist to include when adding the scraped_page_archive to a scraper:
(NB this assumes that the Scraper Change checklist is already included).
Adding Archiving to Scraper checklist
- 1. we are using at least version 0.5 of scraped_page_archive gem?
- 2. scraper uses scraped_page_archive gem directly or via a suitable strategy?
- 3. MORPH_SCRAPER_CACHE_GITHUB_REPO_URL is configured?
- 4. pages are being archived in new branch of correct scraper repo?
Notes:
1: versions before this were significantly slower, as we were recloning the repo on every request. So make sure we're using at least this version. (If required directly, then the Gemfile.lock should list a version number.)
2: if this is an "old-style" simple scraper where everything happens in scraper.rb
itself, this will often simply be a matter of adding a require 'scraped_page_archive/open-uri'
(and making sure it's not clashing with anything else, like open-uri-cached
. If it's a "new-style" scraper using ScrapedPage, then this will require configuring a suitable response strategy. The checklist should note which is happening.
3: This will need to be set in the "secret environment variables" section of the Morph configuration. If you don't have permission to do that (e.g. the scraper is in someone's personal Morph account), you should ping the owner here to let them know to do this. If you have moved the scraped to a different morph account as part of this change, you should also let the owner of the old version know too, otherwise the version running in their morph account will suddenly break once this is merged, as it won't have archiving configuration)
4: When adding archiving you should do a dummy run locally to make sure everything is working fine, and you see data actually appearing in the archive. You don't need to run that to completion, especially for a scraper that will take a long time, but you should verify that at least one of each "type" of page you expect to see there gets archived (e.g. list page + individual member page).