Web archive quality assurance workflow - smith-special-collections/sc-documentation GitHub Wiki

After a test crawl has completed:

  • Open Web Archiving Trello Board
  • Identify a website URL to review in the “Ready for QA” list
  • Log in to Archive-It administrative mode
  • Click on the appropriate Collection (College Archives or Sophia Smith collection)
  • Click on “Seeds”
  • At the top of the seeds list, type all or part of the URL to filter the list and locate the website you want to review
  • Click on the number in the “Captures” column to view all captures
  • Select the crawl you wish to review and click “View Report”
  • As you review the captured website, add QA notes in the Trello card Comments section

Analyze the crawl report

  • Check the overall crawl status
    • Note in the Trello card comments section if the crawl completed, or if it hit a time or data limit
    • Note the total amount of “New” data captured in the spreadsheet (MB, GB?)
  • Check seed status
    • Click on Seeds tab
    • If there is more than one seed (URL) listed, identify the correct one to review
      • Click on individual Seeds to view the host report
      • Note the amount of “New” data captured for just that seed
  • In the Hosts report, check to see if any documents were queued, blocked, or “Out of scope” from the original seed URL (note: you may see content out of scope from different hosts, such as youtube, which is OK- you can ignore them)
  • In the Trello card comment section, note:
    • if the capture finished or hit time/data limit
    • the total amount of “New” data captured
    • approx how many documents are queued or blocked for the main host URL

View the captured site in Wayback

  • In the Seeds list click on “View in Wayback” (note: pages can take a long time to load in the Wayback machine, be patient!)
  • Verify whether or not you have a complete and functional archive of a website.
  • Click through all the sections of the website and note especially:
    • are all top level pages were captured?
    • are any styles (fonts, layout) or images missing?
    • is there embedded media was it captured; does it play?
  • Open the live website in a separate window and compare to the captured site
  • Note any issues re: missing content in your QA notes in the Trello card (example: “Home page slideshow images missing; main menu not displaying.”)

When done, move the Trello card to the “QA completed” column.

For web captures that have already been saved (not test crawls)

  • In the Seeds list click on “View in Wayback”
  • Click the most recent capture date (if that doesn’t work click the next most recent)
  • Click “Enable QA”
  • Click through all the sections of the website and note any issues as above.

Notes on QA for Facebook pages

  • You must be logged into FB to view Facebook captures. You can use your own account or the SCL generic login, see https://trello.com/c/rQzlGbNI
  • If you see multiple capture links in the calendar page, review on all the links
  • For more information see “What to expect from archived Facebook seeds”

More QA help and tips:


Troubleshooting tips

Wayback QA and patch crawling should be used:

  • If it appears most of the content of the website is captured but there are missing elements in a page (stylesheet, embedded images, etc), or or if you notice blocked content in the host report, then you can try running a patch crawl.
  • Note: the test crawl must first be saved -- you cannot run a patch crawl on a test. Once saved, it takes up to 24 hours for the saved test to appear as a permanent archived seed in Wayback.
  • Do not save test crawls if there are other issues such as posible crawler traps and other unwanted content. Check with your supervisor if you are uncertain whether a test crawl should be saved.

Recrawl to capture queued content

  • If a test crawl looks good and if there are no obvious crawler traps, but there are queued documents, it means the site was too large to be captured in one session.
  • In this case the archivist will save the test crawl and run another to capture queued content. For first-time captures of very large websites, sometimes this requires multiple test crawls.
  • If there are suspected crawler traps (such as calendar pages) then the test crawl is deleted, the crawl is scoped to omit specific types of content, and a new test crawl is run.

Note: for regular (non-test) crawls, you can resume the crawl if you hit the time (or data) limit.

Brozzler (experimental crawling technology)

Use Brozzler only when you’ve tried standard crawls and a lot of content is missing. Specifically recommended for:

  • sites with dynamic content
  • specific types of sites: Wix sites; Instagram

You can also try using WebRecorder (aka Conifer) to capture dynamic sites that can’t be captured by Archive-It.