Using the V1 synchronization service controlling harvesters - IKANOW/Aleph2 GitHub Wiki

Overview

This page describes how to use the existing V1 infrastructure to control V2 harvester components

Uploading shared JARs

For each harvest technology and/or module JAR developed, use the File Uploader to upload to V1, with the following fields:

  • (binary upload)
  • The title field should be the desired V2 "path" of the JAR, described as an absolute path from the "library root", eg /app/aleph2/library - eg /app/aleph2/library/harvest_tech_1.jar
  • The first line of the description field should be the entry point, ie the class implementing IHarvestTechnologyModule - eg com.ikanow.aleph2.test.example.ExampleHarvestTechnology
    • (or for enrichment modules, the class implementing IEnrichmentBatchModule, or for analytic technologies, the class implementing IAnalyticTechnologyModule)
    • In some cases there will be no entry point, eg libraries that just need to be in the classpath. In this case just use any string, it will be ignored but must exist.
  • If the next line starts with "{" then subsequent lines are treated as JSON and used as the library_config field of the bean. The JSON parser will keep reading until it encounters a line that doesn't start with a '"',{, or } character.
  • The remaining lines of the description fields are copied directly into the V2 library bean (SharedLibraryBean), except for:
    • Optionally, the final line can be in the format: "tags: tag1, tag2, ...", and each tag is copied into the V2 library bean

Creating the Data Bucket

Visit the Source Editor, select "New Source" (top right), and select the template "V2 Data Bucket Template" from the "Create New Source" dropdown.

Edit the bucket JSON using DataBucketBean as the guide (see below for further details; note that the Source Builder UI is not currently available - support will be added if popular enough!)

The following should be noted during this configuration activity:

  • The majority of the data bucket goes under processsingPipeline.data_bucket, which should be a mirror of the the desired DataBucketBean, except for:
    • created and modified are auto-generated from the Source Editor (via the V1 source fields)
    • display_name is taken from the V1 source title, as specified in the Source Editor
    • tags is taken from the V1 source tags, as specified in the Source Editor
    • owner_id is auto-generated by the V1 API (via the V1 source field ownerId)
    • access_rights is auto-generated by the V1 API (via the V1 source field communityIds)
  • The desired harvest technology should be referenced by its path (ie the title of the uploaded share from the previous step) in the harvest_technology_name_or_id field
    • Ditto for any libraries and modules, in the harvest_configs.library_ids_or_names
  • If multi_node_enabled is true, then one instance of each harvester is launched on every available API node (this makes no sense for technologies like Storm that are distributed, but gives "free" scalability to eg standalone processes)
  • The specific harvest configuration JSON format is defined by the harvest technology - consult its documentation, Aleph2 just passes it through in the harvest_configs.config field
  • The specific enrichment configuration JSON format is defined by the enrichment topology used - consult its documentation, Aleph2 just passes it through in the batch_enrichment_topology.config / batch_enrichment_configs.config / streaming_enrichment_topology.config / streaming_enrichment_configs.config
    • Currently only streaming_enrichment_topology and batch_enrichment_configs are available:
      • Streaming enrichment can be enabled by changing master_enrichment_type from "none" to "streaming".
      • If the streaming_enrichment_topology.enabled is set to false then a default topology that takes data objects emitted from the harvesters and sends them to the enabled services (See below). If it is set to true then the topology pointed to by the "library names or ids" field is used instead.
      • Batch enrichment can be enabled by changing master_enrichment_type from "none" to "batch".
  • Automatic output to elasticsearch can be enabled via adding "data_schema": { "search_index_schema": { "enabled": true, /*...*/ } } to the bucket. To add temporal and columnar support, add "temporal_schema": { "enabled": true, /*...*/ } } and/or "columnar_schema": { "enabled": true, /*...*/ } } also.

Once complete, hit publish in the Source Editor to load into V2.

Data Bucket Control

From the V1 API/Source Editor UI, the following controls are possible:

  • Publish - will create or update the bucket, note the following special cases:
    • If a publish fails on a new source, the source will be marked as "non-approved" (to prevent the service from continually trying to synchnronize a faulty source, also to highlight the problem to the source developer)
    • If a publish fails on any source, the harvest.harvest_status field is marked as error and the harvest.harvest_message field contains a description of the errors.
  • Suspend/Resume - will suspend or resume the bucket
  • Delete - will delete the bucket

Note that it takes ~2 seconds for the synchronization service to pick up on an action, check the date fields logged in harvest.harvest_message if you are unsure as to whether an action has been applied yet.

The following controls currently do nothing, this functionality will be added later:

  • Test
  • Delete documents

Note finally that the "JS" tabs in the Source Editor do nothing and should be ignored. (As noted above, the "Source Builder" UI is not available).