Passthrough service - IKANOW/Aleph2 GitHub Wiki
The Passthrough service is a very simple enrichment module that basically just outputs every record received (there are some "advanced" options, see below).
There are a few different scenarios under which it is useful:
- Debugging
- Batch enrichment pipelines where all the required processing (if any) has been performed in the harvester
- As described under batch enrichment, the first element in the pipeline cannot have a grouping field. Therefore if the only "user processing" desired is after the grouping then the pipeline can be:
-
PASS -> (group) USER
(where USER might be eg the Javascript enrichment module)
-
- The grouping field might just be used to "rebalance" the processing to a desired number of nodes (eg for some reason only one mapper is generated, so it makes sense to reduce across the cluster), in which case the pipeline can be:
-
PASS/USER1 -> (group) PASS -> USER2
(PASS/USER1 depending on if there is any pre-processing to be performed) - or even:
PASS/USER1 -> (group) PASS -> PASS -> USER2
(see here for brief discussion on how the batching/performance profile depends on the location of the module(s) withgrouping_fields
).
-
The Passthrough service does not execute any user/unsafe code and is therefore safe to allow any user with "write" permissions on a bucket. (And in fact cannot be hidden since it is built into Aleph2 vs being provided via a Shared Library jar).
Passthrough service contains no additional Aleph2 logging.
The enrichment configuration should look like (no module_name_or_id
is needed because the Passthrough service is built into Aleph2):
{
"entry_point": "com.ikanow.aleph2.analytics.services.PassthroughService"
"config": { ... }
}
No config
can be specified, and the Passthrough service will work as advertised.
Alternatively, there are a few "advanced configurations" defined by the following schema inserted into the config
field of the enrichment configuration.
{
"output": string
}
Where "output"
is one of:
-
"$$internal"
- this is the default, the module emits every data object it receives -
"$$stop"
- the module discards every data object it receives.- (this can be useful in complex Spark or Storm topologies where some paths emit data and some don't, and the paths that end with 3rd party modules that emit, you can just chain one of these to the end)
-
<bucket path>
- the most useful "advanced configuration" - each object is emitted externally (not internally) to the bucket path- The
"$<field>"
parameter described below allows users dynamic control over which bucket a data object is routed to. - Don't forget to set the top level
allowed_external_paths
when externally emitting data objects.
- The
-
"$<field>"
- the behavior is determined by the contents of each data object's field (nesting via dot notation supported), ie it should have the value"$$internal"
,"$$stop"
, or a bucket path as above