Web archive quality assurance workflow - smith-special-collections/sc-documentation GitHub Wiki

After a test crawl has completed:

Open Web Archiving Trello Board
Identify a website URL to review in the “Ready for QA” list
Log in to Archive-It administrative mode
Click on the appropriate Collection (College Archives or Sophia Smith collection)
Click on “Seeds”
At the top of the seeds list, type all or part of the URL to filter the list and locate the website you want to review
Click on the number in the “Captures” column to view all captures
Select the crawl you wish to review and click “View Report”
As you review the captured website, add QA notes in the Trello card Comments section

Analyze the crawl report

Check the overall crawl status
- Note in the Trello card comments section if the crawl completed, or if it hit a time or data limit
- Note the total amount of “New” data captured in the spreadsheet (MB, GB?)
Check seed status
- Click on Seeds tab
- If there is more than one seed (URL) listed, identify the correct one to review
  - Click on individual Seeds to view the host report
  - Note the amount of “New” data captured for just that seed
In the Hosts report, check to see if any documents were queued, blocked, or “Out of scope” from the original seed URL (note: you may see content out of scope from different hosts, such as youtube, which is OK- you can ignore them)
In the Trello card comment section, note:
- if the capture finished or hit time/data limit
- the total amount of “New” data captured
- approx how many documents are queued or blocked for the main host URL

View the captured site in Wayback

In the Seeds list click on “View in Wayback” (note: pages can take a long time to load in the Wayback machine, be patient!)
Verify whether or not you have a complete and functional archive of a website.
Click through all the sections of the website and note especially:
- are all top level pages were captured?
- are any styles (fonts, layout) or images missing?
- is there embedded media was it captured; does it play?
Open the live website in a separate window and compare to the captured site
Note any issues re: missing content in your QA notes in the Trello card (example: “Home page slideshow images missing; main menu not displaying.”)

When done, move the Trello card to the “QA completed” column.

For web captures that have already been saved (not test crawls)

In the Seeds list click on “View in Wayback”
Click the most recent capture date (if that doesn’t work click the next most recent)
Click “Enable QA”
Click through all the sections of the website and note any issues as above.

Notes on QA for Facebook pages

You must be logged into FB to view Facebook captures. You can use your own account or the SCL generic login, see https://trello.com/c/rQzlGbNI
If you see multiple capture links in the calendar page, review on all the links
For more information see “What to expect from archived Facebook seeds”

More QA help and tips:

Wayback QA and patch crawling should be used:

If it appears most of the content of the website is captured but there are missing elements in a page (stylesheet, embedded images, etc), or or if you notice blocked content in the host report, then you can try running a patch crawl.
Note: the test crawl must first be saved -- you cannot run a patch crawl on a test. Once saved, it takes up to 24 hours for the saved test to appear as a permanent archived seed in Wayback.
Do not save test crawls if there are other issues such as posible crawler traps and other unwanted content. Check with your supervisor if you are uncertain whether a test crawl should be saved.

Recrawl to capture queued content

If a test crawl looks good and if there are no obvious crawler traps, but there are queued documents, it means the site was too large to be captured in one session.
In this case the archivist will save the test crawl and run another to capture queued content. For first-time captures of very large websites, sometimes this requires multiple test crawls.
If there are suspected crawler traps (such as calendar pages) then the test crawl is deleted, the crawl is scoped to omit specific types of content, and a new test crawl is run.

Note: for regular (non-test) crawls, you can resume the crawl if you hit the time (or data) limit.

Brozzler (experimental crawling technology)

Use Brozzler only when you’ve tried standard crawls and a lot of content is missing. Specifically recommended for:

You can also try using WebRecorder (aka Conifer) to capture dynamic sites that can’t be captured by Archive-It.