Bulk Workflow Design - ScaleUnlimited/atomizer GitHub Wiki
There are three different workflows that are part of the Bulk Workflow System.
The bulk ingestion workflow is a Cascading workflow responsible for creating Attributes from raw input data.
The resulting Attribute tuples, as stored in one or more SequenceFiles, contain the following fields:
- Dataset date
- Dataset reference id
- Record uuid
- Attribute key
- Attribute value
The bulk normalization workflow is a Cascading workflow responsible for parsing the incoming Attribute tuples, and generating normalized data as Anchor tuples.
Based on the Attribute key and the contents of the Attribute value, the parsing code will create one or more Anchor tuples, where each Anchor tuple consists of the following fields:
- Dataset date
- Dataset reference id
- Record uuid
- Attribute key
- Anchor name
- Anchor confidence (float) - 0.0 to 1.0, represents confidence in Anchor assignment.
- Atom list - one or more strings, each containing one "Atom" of data extracted from the Attribute value.
The bulk analysis workflow is a Cascading workflow responsible for aggregating counts for unique Atom values associated with each separate Anchor. We expect that there will be 100 to 200 unique Anchors (e.g. first_name, last_name).
Aggregation can be optionally be done for a subset of all of the data:
- date range
- Dataset (by reference id)
In addition a Dataset aggregation can use an optional date range.
There are two result sets generated by the analysis workflow: Anchor Atom Counts and Anchor Atom Vectors
The Anchor Atom Counts results contain every unique Atom value, and its count, for every Anchor. These are grouped by Anchor, and reverse sorted (high to low) by count. The output format is tab-separated text:
AnchorAtomcount
The Anchor Atom Vectors results contain a vector for each Anchor, stored as a series of serialized Mahout RandomAccessSparseVector objects. The vector name is the Anchor name, the vector values are Atom frequencies, and the indices are the int (4-byte) hash of the Atom value.
As an example, for the Anchor first_name, if the Atom value "KEN" occurred 2.6% of the time, then there would be a vector named "first_name" where the value at position "KEN".hashCode() would be 0.026.
These vectors are used in the incremental workflow (see below) when assigning confidence levels to Anchors during parsing, and also when determining if a specific file or entire dataset has the correct data in each field.