Internet Archive Documentation - mgenuardi/TAEP_documentation GitHub Wiki
I. Introduction to Thomas A. Edison Papers Internet Archive Collections
A. Why Internet Archive?
The Thomas A. Edison Papers (TAEP) began to utilize Internet Archive (IA) in Fall 2019 as an effort to make its Microfilm Edition permanently accessible, ensuring a long-term repository for documents.
- To provide an additional point of access for users searching for this material.
- To preserve and present the microfilm and other documents in a way that allows for engagement and discovery.
B. Logistics - Account and Login Info
Account Email: [email protected]
Password: taeptaep2026
updated 20250912 (outdated: taeptaep2024)
Internet Archive screenname: TAEP
- This is the name that displays on the document and collection pages attributing the uploader. Easily changeable in account settings and does not affect email.
Internet Archive provided information about usernames, email, and account settings: https://help.archive.org/help/accounts-a-basic-guide/
C. Notes and Terminology
- "Internet Archive," "IA," and "archive.org" may be used interchangeably when referring to the digital repository.
- As we are discussing historical documentation that has been microfilmed, digitized, and created as digital objects, just a clarifying note that the document uploaded to Internet Archive can be referred to as a "digital object", "document" and/or "item" interchangeably. Also document and digital object can refer to one singular document (such as a Notebook or Motion Picture catalog) or as a PDF representing an entire collection (such as with the OREP material).
- Note that this document explains the manual process of uploading documents to Internet Archive. We have successfully done some bulk uploading and metadata editing through automated scripts and processes. The fundamentals of preparing the collection and assigning metadata still apply in both cases, but the processes will differ.
D. Top Level Takeaways
- Prepare before creating the collections:
- Decision-making in PDF creation (if relevant)
- File-naming conventions
- Metadata Profile and Description - use what has already been created? Create new? What aligns with the requirements of Internet Archive digital collections?
- URL can only be used once, as it is the main unique identifier
- While the URL cannot be changed or re-used, much of the descriptive metadata and images/document PDFs can be edited or updated after upload
- Sensitive Content Note
- Some of our collections contain documents with outdated views and/or otherwise harmful content. When overtly relevant, we have added notes on individual documents to guide users back to the collection description where we have a statement and hope to contextualize the documents within the historical importance and research value of the collection.
- However, this work is ongoing and evolving. It's important to be aware that there are documents that we have not identified with sensitive and/or problematic material interwoven in the documents that we add to Internet Archive, and we strive to deal with these in the most responsible way.
II. Instructions and Guidelines for Managing Collections on Internet Archive
A. Preparing Documents/Digital Objects
Preparing the documentation for upload to Internet Archive is dependent upon the state of the collection. It is helpful to have an idea of the context and extent of the collection. Early process discussions include:
-
State of the digital object(s): In some cases, we have to create the digital object(s) before uploading, while in other cases the digital objects could be generally ready for upload.
- Notebook Collection example: The notebooks are a part of the microfilm collection, but we wanted to isolate the individual notebooks as opposed to the documents being part of the entire reel. Therefore, we extracted each individual notebook as a PDF from the PDF of the whole reel.
- Motion Picture Catalog example: The PDFs had been previously scanned and prepared on RUCore, but we realized that most were scanned incorrectly. Therefore, before upload, the documents still required heavy quality checks against the legacy edition and PDF editing.
- Outside Repository (OREP) example: These PDFs were not yet created, so we needed to create them by identifying the filename(s) of the corresponding jpeg images and combining those jpeg images together. We also wanted to add descriptive introductions to the document comprised of jpeg images, so we had to work through multiple tests and examples before our computer science specialist was able to create a script to combine the text (word documents) and jpegs in a manner that the frame size and quality of the jpeg images and descriptive word documents were adequately combined.
-
State of the metadata: Are we creating a new metadata template or taking from what has already been written and expressed? Some considerations:
- Document filename: (while it can be edited in upload if desired) archive.org automatically fills the filename of the document being uploaded as the archive.org URL. It also just helps for organizational purposes to have the filename of the document be the same as the link.
- Descriptive text: For the document description, options include creating a new text to introduce each document (Microfilm Collection), copy and pasting straight from the legacy description (Notebook Collection), or creating a new description based on a spreadsheet export of existing metadata (Patents and Motion Picture Catalogs).
-
Internet Archive stipulations: Knowing that we want to create collections and not just upload individual documents, look ahead to the requirements of a collection put forth by Internet archive:
Internet Archive provided guidelines: “Creating Collections - A Basic Guide” https://help.archive.org/help/collections-a-basic-guide/.
Internal document - the requirements and our TAEP decisions are tracked in this google sheet: https://docs.google.com/spreadsheets/d/10cST73Q2t-3lsphsLLMMBc2XP05jI4cE3wjz7_N4HDk/edit#gid=0
B. Uploading Documents
Internet Archive requires 50 items to be uploaded in order to create a collection. Therefore, digital objects must be uploaded individually first. The step-by-step instructions provided here assume a manual upload of one document at a time as opposed to bulk upload using an automated process. The instructions for that process are different, but the metadata profile (below) is still relevant.
- Click UPLOAD in the top right corner of the screen. On the next screen, click the green “Upload Files” button.
- Next, choose the file to be uploaded. Files can be dragged and dropped into the gray square, or you can click the blue button to search for files.
- Adding the file will automatically bring you to the upload screen to fill metadata and set preferences.
Right hand of the screen: File Management
The gray half of the screen shows the file(s) being added. For TAEP purposes, we are assuming only one file is being added. For TAEP purposes, file format has traditionally been a pre-prepared PDF.
Left hand of the screen: Metadata and Options
The white box to the left of the uploading screen is where you populate metadata and edit upload preferences. Those fields starred in red are required (title, URL, description, subject tags, collection), and the others are optional.
C. Metadata Profile for Individual Digital Objects
A breakdown of the main metadata categories given during upload is below, first listed and then in tabular form. A screenshot of how the field will appear on the digital object page is also provided.
Internet Archive provided information about metadata fields: https://help.archive.org/help/uploading-a-basic-guide/
Page Title: Required: Yes Editable after upload: Yes
- The title of the document.
Page URL: Required: Yes Editable after upload: No
- This is the URL link that will be used to access the individual document/digital object. Note that it automatically fills as the name of the file uploaded. This is NOT editable, and can only be used once. If edits need to be made to the document, it must be made within this page by using the “edit” option (explained more below). If this URL is deleted, it can not be used again.
Example: When uploading our first collection to Internet Archive, the Microfilm Collection, we decided on the file format (and therefore the URL identifier) for each PDF reel was to be edisonmicrofilmREELno, i.e. edisonmicrofilm4 (https://archive.org/details/edisonmicrofilm4). In trying to edit the PDF, MG deleted the already uploaded edisonmicrofilm3 URL. As it can not be retrieved, our current Reel 3 identifier is edisonmicrofilm003 (https://archive.org/details/edisonmicrofilm003). While may not affect a user browsing or arriving to the material via other search methods, it would affect a user who is familiar with the identifier pattern and wanted to type in the URL for this reel, only to find “edisonmicrofilm3” does not exist, and would not necessarily know to try “edisonmicrofilm003.”
Description: Required: Yes Editable after upload: Yes
Text description of the object. This is editable after upload. The text is searchable.
Subject Tags: Required: Yes Editable after upload: Yes
The metadata tags that make the object searchable across Internet Archive. Editable after upload. Note that “Subject Tags” in the upload page are expressed under the “Topics” field on the page of the digital object (see screenshot below).
Creator: Required: No Editable after upload: Yes
Creator of the document. We haven't traditionally used this field, but it shows up on the page below the Title, expressed as “By: CREATOR VALUE.”
Date: Required: No Editable after upload: Yes
Date of document. We haven’t traditionally used this field, although we have added dates to our text descriptions (example: Patents).
Collection: Required: Yes Editable after upload: Yes, but may require communication with archive.org.
- This field will automatically be filled as “Community Texts,” which uploads the document into the general pool of archive.org items.
- Archive.org will create the collection after upload, but the document will still exist in Community Texts or “additional_texts” unless altered.
- If you are uploading to a collection that has already been created, you should be able to find it through the dropdown/search here.
- The ability to change and edit the collection after upload has changed over the years that TAEP has been using Internet Archive (since 2019), so a general rule for the current state of this field is that it is possible to edit the collection by changing the value by editing it in the metadata profile, but some collection work may require communication with archive.org.
Test Item: Required: No Editable after upload: No
If this is marked as “Yes,” the uploaded item will only exist for 30 days and then be removed. This is helpful when we try different methods for new collections.
For example, when testing for OREP documents, we tried out various options such as uploading the document as individual jpegs and early iterations of our PDFs. I uploaded them with random URLs and marked them as “test items” in the title with the intention of deleting them later. However, I should have clicked “Test Item '' for this purpose, where it would have been automatically removed from the community pool eventually.
Language: Required: No Editable after upload: Yes
We generally mark English.
License: Required: No Editable after upload: Yes
We generally add rights to our text descriptions, and do not use this field.
Metadata Profile - Table
_A table of the metadata profile - the information above is expanded, this is a quick visual in tabular form: _
Field Name | Required? | Editable after upload? | Description/Note |
---|---|---|---|
Page Title | Y | Y | The Title of the Document |
Page URL | Y | N | URL link that will be used to access the individual document/digital object. If this URL is deleted, it can not be used again. |
Description | Y | Y | Text description of the object. This is editable after upload. The text is searchable. |
Subject Tags | Y | Y | The metadata tags that make the object searchable across Internet Archive. |
Creator | N | Y | Creator of the document. |
Date | N | Y | Date of document. |
Collection | Y | Y | This will automatically be filled as “Community Texts,” which uploads the document into the general pool of archive.org items. Archive.org will create the collection after upload, but document will still exist in Community Texts or “additional_texts” unless altered. If you are uploading to a collection that has already been created, you should be able to find it through the dropdown/search here. |
Test Item | N | N | If this is marked as “Yes,” the uploaded item will only exist for 30 days and then be removed. This is helpful when we try different methods for new collections. |
Language | N | N | We generally mark English. |
License | N | N | We generally add rights to our text descriptions, and do not use this field. |
Screenshot identifying how and where the metadata is expressed on the webpage of an individual digital object:
D. Creating a Collection and Collection Metadata Profile
While the user (TAEP) has autonomy in uploading individual digital objects, Internet Archive as an entity controls the creation of collections. As mentioned above, Internet Archive requires a minimum of 50 items/documents/digital objects in a collection. After 50 items are uploaded, you may email [email protected] with the required information (see link below) and they will create the collection.
Internet Archive provided information,“Creating Collections - A Basic Guide” https://help.archive.org/help/collections-a-basic-guide/
Points to include in email: Subject: Collection Request Hello, I'd like to request a collection:
- Title: Title here
- Description: Collection Description here
- Identifier: ID for collection here: This is the ID that will appear in the URL. For example, identifier "edison-mpc" = https://archive.org/details/edison-mpc
- Documents included: All of the uploaded documents that are to be within the collection. This can be a list of individual URL links, or it can be a query. For example, when we uploaded the motion picture catalogs, each digital object had a unique subject tag, "EdisonMPC." The collection request could then be one link, a query for subject tag EdisonMPC.
After the collection is created, the user (TAEP) can continue adding and managing the documents. Note that while documents can be added after a collection is created, we have generally uploaded all desired documents and then requested the collection be made at once.
As with the individual documents, metadata can be edited or added to the collection.
E. Managing the Collection
After digital objects are uploaded and a collection is created, the content should remain stable and accessible.
- Note that while there is generally autonomy to upload and edit items and collections, there are instances in which communication with representatives at Internet Archive is necessary or may be helpful. The email is: [email protected]
- Once again, the process for managing the collection outlined here is for manually editing the items, not for bulk metadata edit or document upload This piece of information can be updated, edited, and/or removed according to the future workflow of IA management
The Uploaded Object: Below are screenshots that show an example of one completed and uploaded item. This uploaded digital object, Motion Picture Catalog V171, is part of the Motion Pictures Catalog Collection:
Screenshot of URL and Image Viewer:
Screenshot of metadata profile and information at bottom of page (below image viewer).
- Actions to take as editor. Here, you will find the important “Edit” button if you wish to change or update the metadata or image of an individual item/digital object.
- The metadata required/added upon initial upload, provided by TAEP (Title, Subject, Description, etc.).
- Automatically populated metadata from archive.org (generally technical and administrative)
- Automatically populated download options from archive.org
- The collections that the document is assigned to. This is the same as the “Collections” field in the metadata in space “2.”
- Uploaded by TAEP (our screenname), with a link to all of the documents uploaded by TAEP.
Editing the metadata or the image: To edit or update both the metadata fields/values or the image, go to the page of the item you would like to edit and click “Edit” (in Box1 in the screenshot above, zoomed in below).
It will bring you to this page:
To update the metadata, choose the box to the left, “I want to change the information (metadata) about my item.”
This will bring you to a page in which you can manually change basic descriptive metadata, like the title or description. I have had success changing the collection in this page as well. The information you can change will be easy to type, and the automated fields that are not changeable will not be able to be edited. The screenshot provides the basic fields at the top of the page, but it continues further and you can explore and add additional fields within this editor.
To update the files (images), choose the box to the right, “I want to change the files in my item.”
The file editor below will appear. I have used this when we noticed a mistake in our PDF and needed to edit and upload a new PDF. As we do not want to delete the whole URL and reupload with a new URL and re-do the metadata profile, you can delete the .pdf file from the editor here. Then, click the blue “Add a file” button, to add the updated PDF. It will take the usual time to process the image, but the document will be updated to the newer version.
There may be more uses for the file editor, but the above mentioned is the only instance I have had to use it for.
III. Edison Papers Internet Archive Collection Summary
A. Structure of Current Internet Archive Collections
The Thomas A. Edison Papers Internet Archive Collection Collection identifier: edison-papers Collection link: https://archive.org/details/edison-papers
The following table lists the sub collections of the Internet Archive Thomas A. Edison Papers Collection.
Subcollection Identifier | edison-microfilm | edison-notebook | edison-patents | edison-mpc | edison-OREP |
---|---|---|---|---|---|
status | complete | complete | complete | complete | in process |
URL | https://archive.org/details/edison-microfilm | https://archive.org/details/edison-notebook | https://archive.org/details/edison-patent | https://archive.org/details/edison-mpc | not yet created |
Item identifier | edisonmicrofilm# | taepnotebook-GLOC | edisonpatent# | edisonmpc-GLOC | edisonOREP-GLOC subject to change |
Document Title | Edison Microfilm Reel [reel number] | Edison Notebook [TAEP gloc] | Edison Patent [patent number] | Motion Picture Catalog [TAEP gloc] | to be decided |
Item description, general format | Standard introduction paragraph across all 288 reels and table of contents for individual reel | Legacy description (already created for each notebook) updated to edison digital | Formula created from metadata in excel spreadsheet. | Formula created from metadata in excel spreadsheet, plus note to see collection description for a note about sensitive content | to be decided |
Subject/Topic | EdisonMicrofilm; Thomas A. Edison Papers | EdisonNotebook; Thomas A. Edison Papers | EdisonPatent; Thomas A. Edison Papers | EdisonMPC; Thomas A. Edison Papers; Motion Picture | EdisonOREP; Thomas A. Edison Papers |
Document Count in Subcollection | 288 | 1,105 | 1,093 | 536 | not yet uploaded (roughly 100) |
Additional Notes or Comments | Note that there are 288 reels, but actually 289 documents that appear in this collection - 1 is a letter uploaded by BB in 2017. | Note that different document count between Omeka and IA has been observed and discussed. Omeka has 1052 items, while IA has 1,105. This appears to be due to the 57 notebooks that are “additional” to the Notebook collections, i.e. those documents that exist as notebooks but within other subcollections in the series notes, and a 4 other discrepancies on Omeka and not archive.org | Note that 6 of the 536 documents are the PDFs of the full microfilm reels that the MPCs come from. Therefore, there are actually 530 individual MPCs. |
B. Mirroring Collections in Omeka-S
Collection | Omeka-S link | IA Collection link |
---|---|---|
Microfilm/Digital Edition | https://edisondigital.rutgers.edu/series-notes (the link directs to the Digital Edition Series Notes) | https://archive.org/details/edison-microfilm |
Notebook Collection Subsite | https://edisondigital.rutgers.edu/notebooks/home | https://archive.org/details/edison-notebook |
Motion Picture Catalog Subsite | https://edisondigital.rutgers.edu/motion-picture-catalogs/home | https://archive.org/details/edison-mpc |
Patent Subsite | https://edisondigital.rutgers.edu/patents/welcome | https://archive.org/details/edison-patent |
- Use of Internet Archive for IIIF viewer The use of the IIIF manifest is one of the crucial connections between TAEP on Internet Archive and Omeka-S. When a document is uploaded to Internet Archive, a IIIF JSON manifest is automatically generated. We utilize this JSON link in the image viewer (Mirador) on the Omeka-S subsite.
IA/IIIF Links:
- IA Blog post: https://blog.archive.org/2023/09/18/making-iiif-official-at-the-internet-archive/
- IIIF Guide Page: https://iiif.io/guides/guides/archive.org/
- Metadata content should be the same, but may manifest differently
- While the values of the descriptive and administrative metadata are the same between Omeka-S and Internet Archive, it is not always expressed in the same way.
-
- Structure: Internet Archive dictates the fields they require and offer, and we utilize the fields in Omeka-S that best suit the needs of the documents specific to TAEP. Therefore, while the values are the same, there are instances in which the information is expressed differently.
-
- Internal Decisions: Example in the Patent Collection - The dates used in the title for the descriptive text on Internet Archive is the date executed, whereas in the title on Omeka-S the date used in the title is the Date Issued (Edison Patent 27493, https://archive.org/details/edisonpatent273493 vs. https://edisondigital.rutgers.edu/patents/document/PAT273493)
- Document counts As noted in the table above, there are instances between archive.org and Omeka-S that the document count is not exactly the same, and those should be able to be explained. Otherwise, the document count between the two should be the same.