Web archiving workflow - smith-special-collections/sc-documentation GitHub Wiki

Workflow overview

Add a new seed
Edit seed settings and add basic metadata
Run a test crawl
Review the test crawl
Save the test crawl, or delete
Scope for future crawls, if necessary
Run a new test crawl, if necessary
Schedule future crawls
Describe archived website in finding aid

The workflow is tracked in the Web Archiving Trello Board and a list of all seeds is recorded in the Archive-It seeds spreadsheet.

Detailed steps

Add a new seed (website) to crawl

Log into the Archive-It administrative mode
Click on Collection - Seeds
Click Add Seeds (upper right)
Enter one or more seeds (URLs), e.g. http://ariseforsocialjustice.blogspot.com/
- Enter multiple seeds on separate lines
- To capture the whole site domain, include a slash at the end of the URL (usually)
- Do NOT use an end slash if you want to limit the crawl to specific page
Select values for
- Visibility (see Access protocols below for restricted content and embargo periods)
- Frequency (see Protocols for crawl frequency below)
- Seed Type (usually “Standard”)
Click Add Seeds

By default, new seeds will appear at the top of the seed list in Collection--Seeds view.

Edit seed settings and add basic metadata

Click on the seed url to add or edit settings, metadata, notes, scope
Click Metadata -- Edit
Enter required fields, clicking Add after entering each:
- Title: official title of website as it appears at the top of the page. Qualify if more than one site. Use “Grab Title and edit if necessary Example: “Arise for Social Justice (blog)”
- Creator: full name individual or organization that owns the site
- Description: see abstract from Inmagic or online f.a. Keep it brief. Example: “Grassroots advocacy, low-income rights and social justice organization based in Springfield, Massachusetts.“
Relation: full title of collection, eg. Arise for Social Justice Records
Collector: “Sophia Smith Collection, Smith College” or “Smith College Archives”, etc.
Other fields are optional:
- For Subjects use LCSH authority.
- Notes for public may be added as Custom Field. Fields name: “Note:” Private notes should be entered under Notes tab. For example: “Captured by request of donor”; “Live site is inactive, last updated 2008”
Click Done when finished entering metadata

Adding seeds to Groups

Use Groups to relate websites from a single donor/collection (by provenance), record group, or topic. For example,
- In the Collection “SSC Websites”, these seeds are part of the Group: “Arise for Social Justice”:
  - http://ariseforsocialjustice.blogspot.com
  - http://arisespringfield.org/
- In “Smith College Websites” the Group: “News” includes these 4 seeds:

Bulk edit: This feature allows you to edit in bulk any attributes of a seed (such as activating/deactivating a seed, changing the frequency, editing metadata, assigning groups, etc.) To use this, select the check-box next to each seed you would like to edit, then click the 'Bulk Edit' button in the upper right corner. You can also add seed metadata in bulk, see instructions.

Active vs. Inactive seeds A seed or collection is considered "active" when it is scheduled for crawling. If you no longer want a seed or collection to be scheduled for crawling you can make it inactive. Inactive collections can still be accessed by the public.

A collection is considered inactive if it is not currently scheduled for crawling but may be in the future. Seeds or collections can be changed from active to inactive at any time.

Run a test crawl

s* elect the seed in the seed list and click Run Crawl

Crawl type: Test Crawl
Adjust time limit, depending on size of site
For new test crawls, do not apply a data limit and set for maximum time (1 week)

Review the test crawl

Follow the Quality assurance workflow
If there is a lot of queued content, and it looks legitimate (i.e., not a crawler trap), save the test and run another test crawl to capture the queued content.
If it appears there are missing elements in a page (stylesheet, embedded images, etc), or if you notice blocked content in the host report, then save the test and you can then go back and run a patch crawl later.
If there it looks like you hit a crawler trap and have captured unwanted content, delete the test crawl, adjust scoping of seed and run a new test crawl.

Schedule future crawls for active websites

New seeds should always be set to "One-time capture" and a test crawl should be run and checked in order to assure there are no surprises (such as unwanted content) before scheduling automatic crawls.
Future crawls can be manual (test) or automatic, depending on how successful the first capture was. For problematic sites, and large and complex websites, it is recommended to always do a manual (test) crawl and review before saving.
The type of crawl (manual or automatic) is recorded in the spreadsheet
Follow the protocols below to schedule recurring captures
Record schedule in the spreadsheet
Continue to conduct periodic quality assurance on recurring crawls.

Scheduling automatic crawls

For one seed, you can set the frequency of crawls in the seed settings (see Protocols for crawl frequency below))
At the collection level:
- Under collection, click on Crawls - Crawl Schedule
- Edit limits (7 days max)
- Schedule crawl - schedule a date and time in future (note if you click “Crawl now, you still have to schedule future crawls)

Making changes to seeds

If the URL changes for a website:
- If the URL change were just from HTTP to HTTPS, just edit the existing seed,
- If the domain has changed a new seed should be added to the collection, do not edit the existing seed URL. Use seed-level metadata in both seeds to clarify the relationship between the old and current URL for end users. You can also create a Group to link them. The old seed should be made Inactive and future crawls should point to the new URL.

Describe archived website in finding aid

See guidelines for Describing archived websites

Access protocols

By default, all archived content is publicly available on your Archive-It.org homepage to be browsed and searched by patrons. Archived content is available to view within 24 hours after a crawl has completed. Full-text search processing of archived content can take up to 7 days to finish processing.

Ways to restrict content:

Private vs. Public: in seed settings, uncheck the “Publicly Visible box. The site will no longer appear in your Archive-It.org homepage. (note: you can also change Public/private settings for a whole collection)
Restrict access in Wayback based on IP range (i.e., to users in reading room only): Patrons who try to access content outside of the IP range (for example, a home computer) will see an Access Denied message, which can be customized for your organization. Notify the Archive-It team to implement this feature.
Remove access to content and full-text search in Wayback.

Protocols for crawl frequency

Smith College sites

Weekly: sites or pages that are updated regularly and content is high priority to preserve
- example: official campus news sites and smith.edu home page
Quarterly: Sites where content changes each semester and summer, and previous content is likely to be deleted
- example: LibGuides
Semi-annual: content changes significantly each semester and previous content may be deleted
- example: Sophian; active social media sites
Annual: sites that tend to remain stable and content is not likely to be deleted
One-time capture:
- New seeds or troublesome websites that require attention each time they are crawled
- sites where content changes very infrequently and capture can be done manually, as needed (example: SC Omeka site). You can add a reminder to your calendar to check once a year for new content
- Inactive sites (not being updated at all)
- Temporary sites (example: a one-time conference blog site)

SSC donor sites

Annual: Most active sites are crawled annually, except for
Semi-annual:
- very large and active sites, examples: PPFA, YWCA, AAUW
- lots of video added regularly
- active social media sites Or, if a donor tells us that content is periodically removed, then we may schedule more frequent captures.
One-time capture:
- New seeds and troublesome websites that require attention each time they are crawled
- Inactive sites
- Temporary sites (example: conference site)