Zotero Harvester Architecture - ubtue/ub_tools GitHub Wiki

Architecture

Zotero Harvester Flowchart

The Zotero Harvester architecture is broadly divided into four modules/components:

  • Config - This module handles all the configuration-related processes used by the harvester. It can be thought of an in-memory wrapper of the zotero_harvester.conf config file. It additionally contains classes to handle external data, i.e., zotero-enhancement-maps.
  • Download - This module handles the fetching of metadata from remote servers. It contains logic for parallel processing of download requests, rate-limiting and communication with the Zotero translation server.
  • Conversion - This module is responsible for converting the metadata extracted by the Zotero translation server into valid MARC-21 records. During the conversion process, additional filtering and augmentation of the metadata are performed.
  • Util - This module contains the basic building blocks of logic that are used to build the aforementioned modules.

Modules

Config

The primary classes in this module are GlobalParams, GroupParams, JournalParams, and EnhancementMaps. The former three classes map directly to different sections in the zotero_harvester.conf file: GlobalParams maps the entries in the unnamed/global section of the config file, GroupParams corresponds to a named sections that qualifies as a "group", and JournalParams corresponds to all other named sections that aren't "groups". The member variables of these classes correspond directly to individual INI entries in their respective sections. An exhaustive documentation of the INI keys can be found in the zotero_harvester.conf file itself.

EnhancementMaps is a container class that wraps around mapping tables that map ISSN numbers to arbitrary values. These tables are a part of the zotero-enhancement-maps repository. Currently, only a single mapping table, viz., one mapping ISSN numbers to license indicators, is supported. In the future, this class will be deprecated and removed entirely once the license indicators are stored directly in the config file.

These classes are meant to be read-only, i.e., they are instantiated once at start-up when the config file and the enhancement maps are read-in. This is an important invariant as they are immutably shared between threads.

Download

The Zotero harverster supports three different types of downloads when it comes to fetching metadata from websites: Direct, RSS, and Crawl.

  • Direct - Divided into two modes: Translation Server, and Direct Query. The former sends a remote URL to the Zotero translation server and expects a JSON-response from the translation server that pertains to its metadata. The latter attempts to directly download the resource pointed to by the remote URL.

    Direct downloading (referring to the encapsulating download process, not the Direct Query mode) is the fundamental building-block of the download module. In fact, the remaining two download types are built as abstractions on top of this method. While exposed to the API, this download process is only meant to be used internally, i.e., journals are restricted to RSS and Crawl. Results of direct downloads are cached on successful execution for the rest of the session.

  • RSS - This is the primary download process used by journals. The resource pointed to by the remote URL is interpreted as a RSS feed. Individual items in the feed are then downloaded (using the Direct download process).

  • Crawl - This is the secondary download process used by journals that have no RSS feed. The web page pointed to by the remote URL is downloaded and outgoing URLs satisfying a particular criterion are recursively crawled until a certain depth is reached. A second criterion is used to determine which URLs are meant to be harvested for their metadata.

The primary class in this module is DownloadManager. It is responsible for queuing and executing download operations. It is also responsible for rate-limiting downloads. This is done to prevent remote hosts from identifying the harvester as a(n unwanted) web-scraper.

Rate limiting is implemented at a domain-level. Every queued download operation is separated into individual queues based on the domain of their remote URL. Each domain queue has an associated DelayParams parameter that describes the timing information required to perform the rate-limiting. DelayParams are initialized with the default delay parameters found in the robots.txt file of each domain. If a domain does not provide the necessary parameters in its robots.txt file, the final delay parameters are calculated from the settings found in the harvester config file.

DownloadManager has a two-tiered queue system. Incoming download requests are added to a tier-1 queue that corresponds to the request's download operation. Tier-1 queues act as a staging area for the tier-2 queues; this minimizes contention in the highly multi-threaded harvesting scenario. A background worker thread continuously polls the tier-1 queues at regular intervals and moves its contents to the tier-2 queues. Tier-2 queues are separated by domain and download operation, i.e., each domain gets its own queues (one for each download operation). Tier-2 queues are exclusively accessible to the background worker thread.

After the tier-2 queues are updated, the rate-limiter determines if new download tasks can be queued. This depends on two factors: the total number of active download tasks and if there are any download tasks active for a given domain. The rate-limiter ensures that no more one active download task is executing for any given domain at any point in time. If a domain has an active download task, the rate-limiter waits until it has run to completion before starting the next task in the domain's queue. Similarly, if the total number of active download tasks (across all domains) reaches a specific threshold, no new tasks are queued until some of the former run to completion. The threading model used by DownloadManager has individual allocations for each download operation, i.e., Direct download operations have their own "pool", RSS their own, etc.

Active download tasks run asynchronously. Upon completion, DownloadManager automatically cleans up any resources they used.

Conversion

Once the Zotero translation server extracts metadata from a website, it returns it as a JSON object (the result of a download operation). This JSON object is preprocessed and converted into a MetadataRecord object that represents a format-agnostic metadata record. The converted object is then validated and augmented with extra information. During the validation stage, multiple checks are performed to determine if the metadata record should be excluded from conversion. Records that pass the exclusion check are then converted into MARC-21 records.

The ConversionManager class works similarly to the DownloadManager class.

Util

Refer to the source code for a detailed break-down of what each class in the module does.

Zeder Interoperability

Information on how the harvester interacts with Zeder can be found here.

Source Files

Delivery Pipeline

The Zotero Harvester delivery pipeline is set up to automate the harvesting, validation and delivery of records to the BSZ servers. It is performed with the help of the zts_harvester_delivery_pipeline.sh, which runs daily at 20:00 as a cronjob. The pipeline stages are listed below in order:

Stage Tool Comment
Harvest URLs zotero_harvester Harvests journals for testing or production.
Validate Generated Records validate_harvested_records Checks if the generated records contain all the expected MARC fields.
Upload to BSZ Server upload_to_bsz_ftp_server Uploads the validated MARC-XML files to their destination directory on the BSZ server.
Archive Sent Records archive_sent_records Saves all uploaded records to the host's delivery database for archival.
Check for Overdue Articles journal_timeliness_checker Checks the host's delivery database to find journals for which no reasonably new articles were delivered.
Only used in production mode.

The mode of delivery depends on the server hosting the pipeline: On Nu, all journals earmarked for testing are harvested and delivered. On ub28, all journals earmarked for production are harvested and delivered.

Validation

Once the Zotero harvester generates MARC 21 records for all harvested records, they are validated for errors by the validate_harvested_records tool.

Internal (hard-coded) Standard Validation (applied to all records)

For every records, the following rules will aways be tested:

  • that the control fields 001, 003, und 007 exist.
  • that the subfield 245$a exists.
  • that, if a field 655 exists, it contains the subfields a, 0, 0 (yes, twice!), and 2 with specific content:
    • a = "Rezension"
    • 0 = (DE-588)4049712-4
    • 0 = (DE-627)106186019
    • 2 = gnd-content

User-configurable Validation

In our web interface 4 different types of rules may be created:
Rules that are specific to an individual serial (= typically a journal) and those that apply to all serials. Both types of rules can be optionally be restricted to apply only to review articles.

This whole mechanism is based on "expectation values" for the corresponding MARC 21 subfields.

  • ALWAYS: The subfield must be present in all records.
  • SOMETIMES: The subfield is allowed to be present or missing, depending on the provided information.
  • IGNORE: The subfield should completely be ignored by the QA process.

Rule Propagation

Rules are always applied on the basis of a MARC field and not a subfield. What that means is that if any rules are found for a field's subfields the rule propagation stops at that point. For example: if you have a record for Journal XXXX and field NNN is being tested. If there are any subfield rules for Journal XXXX and field NNN, any rules for field NNN that apply to all journals will not be applied!

The order in which the rules are applied depends on the type of record:

For regular articles:

  • journal settings
  • global settings

For review articles:

  • journal settings (review)
  • global settings (review)
  • journal settings (regular articles)
  • global settings (regular articles)

If there are serial-specific rules for subfields of a field, those will be applied. If there are no serial-specific rules for subfields of a field but there are rules for all serials and that field, then those rules will be applied.
For review articles, the order is rules for a specific serials and review articles followed by generic rules for review articles irrespective of a specific serial. Finally if none of those rules matched the rules for non-review articles mentioned above will be applied.

If a particular record fails to meet the requirements of any applicable rules it is marked to be delinquent. All delinquent records are then extracted from the original MARC-XML files to prevent them from being uploaded to the BSZ servers. They are instead saved to a secondary file for later perusal (see ZE020110\FID-Projekte\Default and Default_Test).

Archival

Records that have been delivered to the BSZ are archived locally for book-keeping. During archival, each record is keyed to its unique hash. If an incoming record is found to have been already delivered, i.e., has a hash collision or has a URL that was already delivered, it is not archived. This archive forms the basis of the delivery tracking system used by the Zotero harvester, i.e., it's queried the harvesting process to eliminate redundant downloads.

Source Files

⚠️ **GitHub.com Fallback** ⚠️