Deduplication service - IKANOW/Aleph2 GitHub Wiki
The Deduplication service supports the document-orientated concept of "uniqueness" of data objects.
The Deduplication service can also be used as a "semi efficient" lookup engine against other sets of (single or multiple) buckets.
The Deduplication service does not itself execute any user/unsafe code and is therefore safe to allow any user with "write" permissions on a bucket. (And in fact cannot be hidden since it is built into Aleph2 vs being provided via a Shared Library jar).
(Note that the Deduplication service does allow other modules to be executed, which may themselves be unsafe - the security settings on those modules apply though, so the Deduplication service doesn't make the system any more/less safe).
The following system logging subsystems are used (see also below under "Logging"):
-
subsystem
:"DeduplicationService"
or"DeduplicationService.<name>"
, command: .onStageComplete, level: DEBUG - a summary of the deduplication processing that occured, including the following statistics:-
noduplicate_keys
: The number of unique lookup/deduplication keys for which there were no matching stored data objects -
duplicates_incoming
: The number of input data objects -
duplicates_existing
: The number of data objects already in the data store matched by incoming data objects -
duplicate_keys
: The number of unique lookup/deduplication keys for which there were matching stored data objects -
deleted
: The number of data objects automatically or manually deleted in the custom deduplication stage
-
-
subsystem
:"DeduplicationService"
or"DeduplicationService.<name>"
, command: .onStageComplete, level: ERROR - a miscellaneous error occurred.
In the above logging, if the Deduplication service is performing "system deduplication" (ie based off the document schema in the top level data schema), then then the subsystem name is simply "DeduplicationService"
(and the command name is '"system.onStageComplete"'). If it's an overriden lookup/deduplication stage (ie configured from "doc_schema_override"
) then the subsystem name is the element name (or "no_name" otherwise).
As mentioned above, the Deduplication service can be used in two ways:
- Deduplication
- Lookup
In both cases the configuration is controlled by same object, the Document Schema. Where the schema is specified differs however, this is explained below.
In the former case, it is currently necessary to insert a Deduplication service module into the pipeline (the roadmap intention is/was to integrate it into the MultiDataServiceOutput
so that it ran by default when the bucket's document schema is enabled), eg either at the beginning or middle/end depending on whether you want to avoid the expense of subsequent processing on duplicates, or if some processing is needed to perform the deduplication.
In the latter case, you insert the Deduplication service where you want to perform the lookup, and configure it using the doc_schema_override
field described below.
One final complication is associated with grouping:
- First off, note that there are a few reasons to want to do grouping as part of deduplication:
- By grouping on the deduplication keyset, you can minimize the number of lookups performed (this could be significant if there are a large number of incoming objects with a relatively small keyset, which is probably a fringe case)
- If performing custom merges and you don't group by deduplication keyset then you can run into problems if the data store is non-transactional/eventually consistent (and eg Elasticsearch is very eventually consistent, with a multi-second index refresh time), ie 2 different nodes can try to merge vs the existing object and changes will be lost.
- (There are technology specific ways of getting around this, eg upserts, but since every technology implements different versions of this, it currently isn't supported as an Aleph2 option)
- (But if you do deduplicate the problem goes away, since each data object will only be updated by a single thread)
So basically: in most cases if you are just overwriting/leaving as the result of deduplication, then don't group; if you are doing custom merges, then do group.
If you do group then there's a complication - you can't just do (eg) PASS->(group)DEDUP
because each unique key would generate a separate lookup, which is typically going to kill the overall performance.
But equally you can't do (eg) PASS->(group)PASS->DEDUP
because the PASS->DEDUP
stage would discard the partitioning by keyset.
Therefore what you have to is: PASS->(group)USER1->DEDUP
, where USER1 is a user module (Eg a JS script) that merges all the incoming duplicates into a "diff" object, and then those diff objects (one per keyset) can be batched in order to generate efficient lookups.
The full enrichment configuration for the deduplication service is:
{
"entry_point": "com.ikanow.aleph2.analytics.services.DeduplicationService",
"library_names_or_ids": [ string ],
"config": {
"doc_schema_override": { ... }
}
}
Where:
- If running in
custom
mode, thenlibrary_names_or_ids
should contain any required shared libraries (since themodule_name_or_id
andlibrary_names_or_ids
fields nested under the rootdata_schema.doc_schema
or the abovedoc_schema_override
fields are ignored). -
doc_schema_override
has the same format as the document schema described here.