Incremental Workflow Design - ScaleUnlimited/atomizer GitHub Wiki

The incremental workflow is a Cascading-based workflow that continuously runs in Hadoop and Gigaspaces, processing incoming Datasets as they arrive.

Incremental Ingestion Workflow

The incremental ingestion workflow is a Cascading workflow that runs in Hadoop (or as a daemon?), and parses the incoming Datasets to generate Records as Space Documents in GigaSpaces.

This is the same as the bulk ingestion workflow, other than the output is a GigaSpaces Tap versus an Hfs Tap and a SequenceFile scheme.

Incremental Normalization Workflow

The incremental normalization workflow is a Cascading workflow that runs in GigaSpaces, and is responsible for parsing the incoming Attribute tuples, and generating normalized data as Anchor tuples.

This is the same as the build normalization workflow, other than:

  1. It uses the Anchor Atom Vectors to do a better job of assigning confidence levels to the resulting Anchor tuples.
  2. It tracks statistics for each Anchor, and aborts the workflow if the confidence level drops too low (which typically indicates invalid input data, e.g. the wrong data in a record field).

??? Should we also generate per-Attribute statistics in the build analysis workflow, and do invalid record field detection using this data?