archive.org ↔ Open Library synchronisation - internetarchive/openlibrary GitHub Wiki

In order for Open Library users to access readable and borrowable books from archive.org, the respective data records need to be correctly synchronised.

This page documents the specific and technical requirements, and lists potential challenges to keeping the records synchronised.

Significance

At of 2019-06-24, 60,000 books were in our inlibrary lending program but not available through OpenLibrary.org. That is about 6% of our entire catalog is not borrowable through Open Library. There are another 281k books (excluding inlibrary) in printdisabled which have isbns and MARCs and are not on Openlibrary.org. Without linking up archive.org and open Library we also lose out on the ability to reliably determine the availability of our works. We have chosen now as the time to work on this project because over the past months we have eliminated hundreds of thousands of orphaned additions which was a prerequisite to this project. Successful IA ↔ OL sync also means a revitalized import process which is more effective at importing internet archive works moving forward.

Existing fields

archive.org items

  • openlibrary_edition, format example: OL12345M, creates a link from the item's details page to the exact Open Library edition represented by this scan.
  • openlibrary_work, format example: OL12345W, creates a link from the item's details page to the Open Library work that groups other editions of this scan.
  • openlibrary, format example: OL5189756M, a now DEPRECATED reference to an Open Library edition. Potentially used in the archive.org scanning process to locate MARC records, and in Open Library import code as a short cut for matching existing records. Both uses need to be investigated and updated to use the newer fields above.

Open Library Edition level metadata

  • ocaid, format example: callofdistantmam00ward This is the primary useful link back to an archive.org item. It only stores one value, so there is an issue when there exists multiple scans of an edition on archive.org. Only one is linked from OL to IA, even though multiple IA items may refer to the same edition. The current OL sync process only automatically updates the archive.org item present in this ocaid field.
  • source_records, format example: ["ia:callofdistantmam00ward", ...]

Other less common IA related fields, possibly to be deprecated?:

  • "ia_box_id": ["IA113601"]
  • "ia_loaded_id": ["callofdistantmam00ward"]

Note All fake-subject references to archive.org categories that may have once been used for classifying borrowable status are now deprecated. Examples: In Library, Protected DAISY, Accessible_book, Internet Archive Wishlist, Lending library and possibly others. Issue #2107 tracks this clean up.

See Open Library Client JSON schemata for the currently recognised and useful metadata fields for Open Library records.

Technical requirements

PRIORITY: Borrowable books should be synchronised properly to enable discovery and utilisation

  • All borrowable collection:inlibrary books should have openlibrary_edition:
    • CRITERION NOT MET: collection:inlibrary AND NOT openlibrary_edition:*
    • As of June 2019 there are 60K archive.org items that do not meet this condition.
    • ‼️ SYNCH TASK RESULTS:
      • Matched: 42016 (these are the duplicate archive.org items)
      • Modified: 659 (these were resolved and synchronised)
      • bad-repub-state: 2
      • invalid-ia-identifier: 7
      • invalid-marc-record: 18
      • no-marc-record: 6728
      • item-is-serial: 4480
      • item-not-book: 1776
      • no-imagecount: 25
      • noindex-true: 0
      • not-texts-item: 0
      • Of the 42016 matched ids, 39776 had OLIDs written back, completing the synchronisation.
      • Of those, 11695 Open Library items were updated to reference the lendable copy of that item where previously a printdisabled only copy was linked.
      • This has added 11695 lendable items to OL.
      • Only 24 items could not be synched due to orphaned editions. (see below for more details on this category)
      • @ 3 July, there are now 19,823 unsynched items in this category (improvement: ~40k)

PRIORITY: Items with print-disabled digital copies should be correctly synchronised to enable discovery for those who need them

  • archive.org print disabled collection items representing books, which are not necessarily borrowable by users without print disabilities, should have entries on Open Library to capture the existence of a book we know about, and aid discovery by print disabled users. The following query uses the presence of an ISBN as an indicator that an item is a book with sufficient metadata to count as good for importing.

    • CRITERIA NOT MET: collection:printdisabled AND NOT collection:inlibrary AND NOT openlibrary_edition:* AND isbn:*
      • note the number of items resulting from this query will depend on user account privileges, and not all users will see all print disabled only items by default on archive.org. @ June 2019, there are 330K items in the maximal list that are not linked to Open Library.
      • Existing issue #1047
      • ‼️ SYNCH TASK RESULTS: @ 17 July (after running an import/re-import task for the MARC records) there are now 184,445 print-disabled only items without corresponding Open Library links (improvement: ~150K)
  • The following query attempts to locate items that are printdisabled only, do NOT have ISBNs in metadata, but are good scanned books collection:printdisabled AND NOT collection:inlibrary AND NOT openlibrary_edition:* AND NOT isbn:* AND collection:internetarchivebooks there are 13,708 results, but most appear to have incomplete titles ... strangely with ISBNs in the title field. It looks like these have stalled in the scanning process somehow?

Deprecate openlibrary in favour of openlibrary_edition + _work

  • All archive.org book items with a populated openlibrary metadata field should also have openlibrary_edition.

    • CRITERION NOT MET QUERY: mediatype:texts AND openlibrary:* AND NOT openlibrary_edition:*

    • Possible issue: just because an item has an old openlibrary field does not neccesarily mean it should be on OL if it doesn't meet the other criteria listed on this page.

    • Total @ June 2019: 25,190

    • Breaking this category down further into other categories listed further down:

      • In Library: mediatype:texts AND openlibrary:* AND NOT openlibrary_edition:* AND collection:inlibrary

      • Open collection: mediatype:texts AND openlibrary:* AND NOT openlibrary_edition:* AND NOT collection:printdisabled AND NOT collection:inlibrary

        • 6,801
        • Following an example, we have two ia scans https://archive.org/details/annalenderphysi249unkngoog https://archive.org/details/bub_gb_xAQ4AAAAMAAJ an 1833 and an 1803 edition, which both point (using different fields) to a 1900 Open Library edition, further complicated because this item is a serial
        • This category needs to be filtered down -- to only include open items in known good collections Filtering these down to items in the americana and internetarchivebooks shows that most of those remaining are early serials that get excluded from the latest import criteria.
      • Printdisabled only with ISBN:mediatype:texts AND openlibrary:* AND NOT openlibrary_edition:* AND NOT collection:inlibrary AND collection:printdisabled AND isbn:*

        • 3,124
      • Printdisabled only without ISBN:mediatype:texts AND openlibrary:* AND NOT openlibrary_edition:* AND NOT collection:inlibrary AND collection:printdisabled AND NOT isbn:*

        • 387 items
        • The collections that seem to signify these items without ISBNs are good books are internetarchivebooks and americana -- the criteria for non-lendable books below should be updated to include these pre-isbn books that are likely to have good metadata.
  • All archive.org book items with openlibrary_edition MUST have openlibrary_work, and vice versa.

    • CRITERION NOT MET: mediatype:texts AND openlibrary_edition:* AND NOT openlibrary_work:*
    • CRITERION NOT MET: mediatype:texts AND openlibrary_work:* AND NOT openlibrary_edition:*

Only import and synchronise books (IA mediatype:texts)

  • Openlibrary identifiers on archive.org should only be on mediatype:texts items as only books should be represented on Open Library.
    • CRITERION NOT MET openlibrary:* OR openlibrary_edition:* OR openlibrary_work:* AND NOT mediatype:texts
    • Some items not meeting this technical criterion may be legitimate. Some items appears to be gallery catalogs (i.e. books) that are linked to mediatype:image, and other archive.org items could be legitimate books that are mis-categorised. @ June 2019 there are 55 items matched above. Each item needs to be examined to find the fix, or at least to come up with a set of fix categories. Simply deleting the linking metadata would be incorrect in many of these situations as the links are probably a sign of further data issues on OL or IA.

Orphaned items with ocaid

This problem affects the ability of items to become synchronised using the existing mechanisms. The requirement for including both work id and edition id is affected. The solution is to resolve and add works for all editions which don't have them. This overall effort is being tracked on another wiki page , but the following notes relate to orphans with OCAIDs.

remaining total @ 26 June 2019: 38347

NONE are duplicated

re-running re-import process

20730 were successfully matched or had works created, fixing the orphan (54%)

316 were matched on a different existing edition. !!FIX for these: get orphan by opening https://openlibrary.org/books/ia: and then associate it with the matched work.

the remaining 17301 were not resolved due to the following issues:

Proposal: the no-imagecount and noindex-true OL orphans should simply be deleted. They tend to have been created from problematic archive.org records that we would not currently import, and the main reason for the no-index flag appears to be mismatched metadata, or otherwise bad scans. The records I have checked all seem to have better non-broken scanned items elsewhere, and have been imported properly via those.

⚠️ **GitHub.com Fallback** ⚠️