Using the V1 synchronization service controlling harvesters - IKANOW/Aleph2 GitHub Wiki
Overview
This page describes how to use the existing V1 infrastructure to control V2 harvester components
Uploading shared JARs
For each harvest technology and/or module JAR developed, use the File Uploader to upload to V1, with the following fields:
- (binary upload)
- The title field should be the desired V2 "path" of the JAR, described as an absolute path from the "library root", eg
/app/aleph2/library
- eg/app/aleph2/library/harvest_tech_1.jar
- The first line of the description field should be the entry point, ie the class implementing
IHarvestTechnologyModule
- egcom.ikanow.aleph2.test.example.ExampleHarvestTechnology
- (or for enrichment modules, the class implementing
IEnrichmentBatchModule
, or for analytic technologies, the class implementingIAnalyticTechnologyModule
) - In some cases there will be no entry point, eg libraries that just need to be in the classpath. In this case just use any string, it will be ignored but must exist.
- (or for enrichment modules, the class implementing
- If the next line starts with "{" then subsequent lines are treated as JSON and used as the
library_config
field of the bean. The JSON parser will keep reading until it encounters a line that doesn't start with a '"',{, or } character. - The remaining lines of the description fields are copied directly into the V2 library bean (
SharedLibraryBean
), except for:- Optionally, the final line can be in the format: "tags: tag1, tag2, ...", and each tag is copied into the V2 library bean
Creating the Data Bucket
Visit the Source Editor, select "New Source" (top right), and select the template "V2 Data Bucket Template" from the "Create New Source" dropdown.
Edit the bucket JSON using DataBucketBean
as the guide (see below for further details; note that the Source Builder UI is not currently available - support will be added if popular enough!)
The following should be noted during this configuration activity:
- The majority of the data bucket goes under
processsingPipeline.data_bucket
, which should be a mirror of the the desiredDataBucketBean
, except for:created
andmodified
are auto-generated from the Source Editor (via the V1 source fields)display_name
is taken from the V1 sourcetitle
, as specified in the Source Editortags
is taken from the V1 sourcetags
, as specified in the Source Editorowner_id
is auto-generated by the V1 API (via the V1 source fieldownerId
)access_rights
is auto-generated by the V1 API (via the V1 source fieldcommunityIds
)
- The desired harvest technology should be referenced by its path (ie the
title
of the uploaded share from the previous step) in theharvest_technology_name_or_id
field- Ditto for any libraries and modules, in the
harvest_configs.library_ids_or_names
- Ditto for any libraries and modules, in the
- If
multi_node_enabled
is true, then one instance of each harvester is launched on every available API node (this makes no sense for technologies like Storm that are distributed, but gives "free" scalability to eg standalone processes) - The specific harvest configuration JSON format is defined by the harvest technology - consult its documentation, Aleph2 just passes it through in the
harvest_configs.config
field - The specific enrichment configuration JSON format is defined by the enrichment topology used - consult its documentation, Aleph2 just passes it through in the
batch_enrichment_topology.config
/batch_enrichment_configs.config
/streaming_enrichment_topology.config
/streaming_enrichment_configs.config
- Currently only
streaming_enrichment_topology
andbatch_enrichment_configs
are available:- Streaming enrichment can be enabled by changing
master_enrichment_type
from"none"
to"streaming"
. - If the
streaming_enrichment_topology.enabled
is set tofalse
then a default topology that takes data objects emitted from the harvesters and sends them to the enabled services (See below). If it is set totrue
then the topology pointed to by the "library names or ids" field is used instead. - Batch enrichment can be enabled by changing
master_enrichment_type
from"none"
to"batch"
.
- Streaming enrichment can be enabled by changing
- Currently only
- Automatic output to elasticsearch can be enabled via adding
"data_schema": { "search_index_schema": { "enabled": true, /*...*/ } }
to the bucket. To add temporal and columnar support, add"temporal_schema": { "enabled": true, /*...*/ } }
and/or"columnar_schema": { "enabled": true, /*...*/ } }
also.
Once complete, hit publish in the Source Editor to load into V2.
Data Bucket Control
From the V1 API/Source Editor UI, the following controls are possible:
- Publish - will create or update the bucket, note the following special cases:
- If a publish fails on a new source, the source will be marked as "non-approved" (to prevent the service from continually trying to synchnronize a faulty source, also to highlight the problem to the source developer)
- If a publish fails on any source, the
harvest.harvest_status
field is marked aserror
and theharvest.harvest_message
field contains a description of the errors.
- Suspend/Resume - will suspend or resume the bucket
- Delete - will delete the bucket
Note that it takes ~2 seconds for the synchronization service to pick up on an action, check the date fields logged in harvest.harvest_message
if you are unsure as to whether an action has been applied yet.
The following controls currently do nothing, this functionality will be added later:
- Test
- Delete documents
Note finally that the "JS" tabs in the Source Editor do nothing and should be ignored. (As noted above, the "Source Builder" UI is not available).