Home - ScaleUnlimited/atomizer GitHub Wiki

The Atomizer wiki documents the design of the Atomizer workflows in detail.

Workflows

There are two workflows - bulk and incremental.

The Bulk Workflow is responsible for doing batch processing of (typically) archived data, and generating feature vectors for various Anchors (described below). This is a Cascading-based workflow that runs on top of a Hadoop cluster.

The Incremental Workflow is responsible for doing continuous processing of incoming Datasets, and applying Anchor feature vectors to detect bad Datasets (e.g. wrong fields in records) and generate improved confidence levels for Anchor assignments (??? what else).

Data Types

Input Data

  • Dataset (reference id, date), consisting of one or more Records. These could be coming from a file, a set of files, or other input sources.
  • Record (uuid), consisting of one or more Attributes
  • Attribute (key), containing text data

Processed Data

  • Anchor (name), consisting of one or more Atoms of data and an optional confidence level. This represents a possible assigning of raw incoming data to a specific type of data (e.g. First Name).
  • Atom, consisting of one indivisible unit of data (typically text, can be an integer or currency amount or date).
  • Anchor vector (name), a sparse vector consisting of Atoms and corresponding frequencies for the given Anchor.