Web archive quality assurance workflow - smith-special-collections/sc-documentation GitHub Wiki
After a test crawl has completed:
- Open Web Archiving Trello Board
- Identify a website URL to review in the “Ready for QA” list
- Log in to Archive-It administrative mode
- Click on the appropriate Collection (College Archives or Sophia Smith collection)
- Click on “Seeds”
- At the top of the seeds list, type all or part of the URL to filter the list and locate the website you want to review
- Click on the number in the “Captures” column to view all captures
- Select the crawl you wish to review and click “View Report”
- As you review the captured website, add QA notes in the Trello card Comments section
Analyze the crawl report
- Check the overall crawl status
- Note in the Trello card comments section if the crawl completed, or if it hit a time or data limit
- Note the total amount of “New” data captured in the spreadsheet (MB, GB?)
- Check seed status
- Click on Seeds tab
- If there is more than one seed (URL) listed, identify the correct one to review
- Click on individual Seeds to view the host report
- Note the amount of “New” data captured for just that seed
- In the Hosts report, check to see if any documents were queued, blocked, or “Out of scope” from the original seed URL (note: you may see content out of scope from different hosts, such as youtube, which is OK- you can ignore them)
- In the Trello card comment section, note:
- if the capture finished or hit time/data limit
- the total amount of “New” data captured
- approx how many documents are queued or blocked for the main host URL
View the captured site in Wayback
- In the Seeds list click on “View in Wayback” (note: pages can take a long time to load in the Wayback machine, be patient!)
- Verify whether or not you have a complete and functional archive of a website.
- Click through all the sections of the website and note especially:
- are all top level pages were captured?
- are any styles (fonts, layout) or images missing?
- is there embedded media was it captured; does it play?
- Open the live website in a separate window and compare to the captured site
- Note any issues re: missing content in your QA notes in the Trello card (example: “Home page slideshow images missing; main menu not displaying.”)
When done, move the Trello card to the “QA completed” column.
For web captures that have already been saved (not test crawls)
- In the Seeds list click on “View in Wayback”
- Click the most recent capture date (if that doesn’t work click the next most recent)
- Click “Enable QA”
- Click through all the sections of the website and note any issues as above.
Notes on QA for Facebook pages
- You must be logged into FB to view Facebook captures. You can use your own account or the SCL generic login, see https://trello.com/c/rQzlGbNI
- If you see multiple capture links in the calendar page, review on all the links
- For more information see “What to expect from archived Facebook seeds”
More QA help and tips:
Troubleshooting tips
Wayback QA and patch crawling should be used:
- If it appears most of the content of the website is captured but there are missing elements in a page (stylesheet, embedded images, etc), or or if you notice blocked content in the host report, then you can try running a patch crawl.
- Note: the test crawl must first be saved -- you cannot run a patch crawl on a test. Once saved, it takes up to 24 hours for the saved test to appear as a permanent archived seed in Wayback.
- Do not save test crawls if there are other issues such as posible crawler traps and other unwanted content. Check with your supervisor if you are uncertain whether a test crawl should be saved.
Recrawl to capture queued content
- If a test crawl looks good and if there are no obvious crawler traps, but there are queued documents, it means the site was too large to be captured in one session.
- In this case the archivist will save the test crawl and run another to capture queued content. For first-time captures of very large websites, sometimes this requires multiple test crawls.
- If there are suspected crawler traps (such as calendar pages) then the test crawl is deleted, the crawl is scoped to omit specific types of content, and a new test crawl is run.
Note: for regular (non-test) crawls, you can resume the crawl if you hit the time (or data) limit.
Brozzler (experimental crawling technology)
Use Brozzler only when you’ve tried standard crawls and a lot of content is missing. Specifically recommended for:
- sites with dynamic content
- specific types of sites: Wix sites; Instagram
You can also try using WebRecorder (aka Conifer) to capture dynamic sites that can’t be captured by Archive-It.