Web archiving workflow - smith-special-collections/sc-documentation GitHub Wiki

Workflow overview

  • Add a new seed
  • Edit seed settings and add basic metadata
  • Run a test crawl
  • Review the test crawl
  • Save the test crawl, or delete
  • Scope for future crawls, if necessary
  • Run a new test crawl, if necessary
  • Schedule future crawls
  • Describe archived website in finding aid

The workflow is tracked in the Web Archiving Trello Board and a list of all seeds is recorded in the Archive-It seeds spreadsheet.


Detailed steps

Add a new seed (website) to crawl

By default, new seeds will appear at the top of the seed list in Collection--Seeds view.

Edit seed settings and add basic metadata

  • Click on the seed url to add or edit settings, metadata, notes, scope
  • Click Metadata -- Edit
  • Enter required fields, clicking Add after entering each:
    • Title: official title of website as it appears at the top of the page. Qualify if more than one site. Use “Grab Title and edit if necessary Example: “Arise for Social Justice (blog)”
    • Creator: full name individual or organization that owns the site
    • Description: see abstract from Inmagic or online f.a. Keep it brief. Example: “Grassroots advocacy, low-income rights and social justice organization based in Springfield, Massachusetts.“
  • Relation: full title of collection, eg. Arise for Social Justice Records
  • Collector: “Sophia Smith Collection, Smith College” or “Smith College Archives”, etc.
  • Other fields are optional:
    • For Subjects use LCSH authority.
    • Notes for public may be added as Custom Field. Fields name: “Note:” Private notes should be entered under Notes tab. For example: “Captured by request of donor”; “Live site is inactive, last updated 2008”
  • Click Done when finished entering metadata

Adding seeds to Groups

Bulk edit: This feature allows you to edit in bulk any attributes of a seed (such as activating/deactivating a seed, changing the frequency, editing metadata, assigning groups, etc.) To use this, select the check-box next to each seed you would like to edit, then click the 'Bulk Edit' button in the upper right corner. You can also add seed metadata in bulk, see instructions.

Active vs. Inactive seeds A seed or collection is considered "active" when it is scheduled for crawling. If you no longer want a seed or collection to be scheduled for crawling you can make it inactive. Inactive collections can still be accessed by the public.

A collection is considered inactive if it is not currently scheduled for crawling but may be in the future. Seeds or collections can be changed from active to inactive at any time.

Run a test crawl

s* elect the seed in the seed list and click Run Crawl

  • Crawl type: Test Crawl
  • Adjust time limit, depending on size of site
  • For new test crawls, do not apply a data limit and set for maximum time (1 week)

Review the test crawl

  • Follow the Quality assurance workflow
  • If there is a lot of queued content, and it looks legitimate (i.e., not a crawler trap), save the test and run another test crawl to capture the queued content.
  • If it appears there are missing elements in a page (stylesheet, embedded images, etc), or if you notice blocked content in the host report, then save the test and you can then go back and run a patch crawl later.
  • If there it looks like you hit a crawler trap and have captured unwanted content, delete the test crawl, adjust scoping of seed and run a new test crawl.

Schedule future crawls for active websites

  • New seeds should always be set to "One-time capture" and a test crawl should be run and checked in order to assure there are no surprises (such as unwanted content) before scheduling automatic crawls.
  • Future crawls can be manual (test) or automatic, depending on how successful the first capture was. For problematic sites, and large and complex websites, it is recommended to always do a manual (test) crawl and review before saving.
  • The type of crawl (manual or automatic) is recorded in the spreadsheet
  • Follow the protocols below to schedule recurring captures
  • Record schedule in the spreadsheet
  • Continue to conduct periodic quality assurance on recurring crawls.

Scheduling automatic crawls

  • For one seed, you can set the frequency of crawls in the seed settings (see Protocols for crawl frequency below))
  • At the collection level:
    • Under collection, click on Crawls - Crawl Schedule
    • Edit limits (7 days max)
    • Schedule crawl - schedule a date and time in future (note if you click “Crawl now, you still have to schedule future crawls)

Making changes to seeds

  • If the URL changes for a website:
    • If the URL change were just from HTTP to HTTPS, just edit the existing seed,
    • If the domain has changed a new seed should be added to the collection, do not edit the existing seed URL. Use seed-level metadata in both seeds to clarify the relationship between the old and current URL for end users. You can also create a Group to link them. The old seed should be made Inactive and future crawls should point to the new URL.

Describe archived website in finding aid

See guidelines for Describing archived websites


Access protocols

By default, all archived content is publicly available on your Archive-It.org homepage to be browsed and searched by patrons. Archived content is available to view within 24 hours after a crawl has completed. Full-text search processing of archived content can take up to 7 days to finish processing.

Ways to restrict content:

  • Private vs. Public: in seed settings, uncheck the “Publicly Visible box. The site will no longer appear in your Archive-It.org homepage. (note: you can also change Public/private settings for a whole collection)
  • Restrict access in Wayback based on IP range (i.e., to users in reading room only): Patrons who try to access content outside of the IP range (for example, a home computer) will see an Access Denied message, which can be customized for your organization. Notify the Archive-It team to implement this feature.
  • Remove access to content and full-text search in Wayback.

Protocols for crawl frequency

Smith College sites

  • Weekly: sites or pages that are updated regularly and content is high priority to preserve
    • example: official campus news sites and smith.edu home page
  • Quarterly: Sites where content changes each semester and summer, and previous content is likely to be deleted
    • example: LibGuides
  • Semi-annual: content changes significantly each semester and previous content may be deleted
    • example: Sophian; active social media sites
  • Annual: sites that tend to remain stable and content is not likely to be deleted
  • One-time capture:
    • New seeds or troublesome websites that require attention each time they are crawled
    • sites where content changes very infrequently and capture can be done manually, as needed (example: SC Omeka site). You can add a reminder to your calendar to check once a year for new content
    • Inactive sites (not being updated at all)
    • Temporary sites (example: a one-time conference blog site)

SSC donor sites

  • Annual: Most active sites are crawled annually, except for
  • Semi-annual:
    • very large and active sites, examples: PPFA, YWCA, AAUW
    • lots of video added regularly
    • active social media sites Or, if a donor tells us that content is periodically removed, then we may schedule more frequent captures.
  • One-time capture:
    • New seeds and troublesome websites that require attention each time they are crawled
    • Inactive sites
    • Temporary sites (example: conference site)

Additional resources: