Schemas - Titousensei/sisyphus GitHub Wiki

Each input or action has an input schema and/or an output schema. A schema consists of the list of names of columns.

Inputs and Outputs have only one schema. Because of the way actions link to each other, Inputs have an output schema (the file is read and outputs some columns), and Outputs have an input schema (the columns that come in to go out to a file). Don't worry too much about that, just remember that they have one schema that does what you would expect.

Modifiers have both an input and an output schema. For example, a hash function requires the input of a few columns, and outputs a hash value.

Schema declaration often uses varargs. You can either use one String[] or several String objects to declare your schema.

The Pusher keeps the so-called "current row" with one value for each of the columns declared by all the actions. However, each action will only see a subset of the current row, called "view". The Pusher populates the view for each action using the internal class SchemaAdapter.

Schemas of all the actions are validated before starting the push. The order of the columns in different schemas might be the different, but all the input columns of a given action must have been declared as an output column of at least one previous step. The Pusher will throw a SchemaException at the start if there is any mismatch.

If you're implementing custom Inputs, Outputs, or Modifiers that need to remember previous rows, please note that views are mutable and you will need to make a copy (using System.arraycopy if necessary)

Example:

  Inputs and actions declare their schemas:

      Input file declares its schema:
      - [id, lastname, firstname, company]

      A hashing Modifier declares its input and output schemas:
      - Input: [firstname, lastname];
      - Output: [hashvalue]

      Output file declares its schema:
      - [id, hashvalue]

  The overall schema for the "current row" will be:
  * [id, lastname, firstname, company, hashvalue]

  Now we start the push. The current row has not yet been populated
  and might contain data from the previous rows that we must ignore.
  * [???, ???, ???, ???, ???]

      We read the next entry from the input file. The line might be:
      << "123\tGaudet\tEric\tTheFind\n"

      Input populates its view:
      - Output view: [123, Gaudet, Eric, TheFind]

  The values of the current row are now:
  * [123, Gaudet, Eric, TheFind, ???]

      The parameters passed to the Modifier will be two String[]:
      - Input view = [Eric, Gaudet]
      - Output view = [???]

      The modifier populates the output view
      - Output view = [987654321]

  The values of the current row are now:
  * [123, Gaudet, Eric, TheFind, 987654321]

      The output file gets its view:
      - Input view: [123, 987654321]

      We write the next line to the output file:
      >> "123\t987654321\n"

Previous: Introduction to Sisyphus - Next: Inputs