Home - Titousensei/sisyphus GitHub Wiki
Introduction to Sisyphus
A Sisyphus program generally consists of one or several "pushes", which are loops through all the rows of the inputs. The DVDs example in the README has 2 pushes. For each push, the program will create a Pusher object, and declare the list of actions to apply to each row (which can be modifications, lookups and outputs). Finally, the program will tell the Pusher to push() the inputs, which will start the process of reading and processing the data.
Inputs of a push are processed as a single stream of rows. For each row, all the actions will be executed (some actions can have an if condition). There's no way to backtrack to a previous row.
Each action can have a separate schema, meaning that a given action will only have a view of a subset of the global schema of the row. This allows some outputs to only print the few columns they need in any order. It also has the benefit of isolating the data for the code, and make sure a bug will not modify the wrong columns.
Different actions can use the columns of the inputs or other actions simply by using the same column names for their schema. Each row is loaded into the "row container", which is a String array containing the value of all the column of every schema of all the inputs and actions. The columns are not typed: everything is a String. For best performance, the String value of each column are populate "by reference": the objects are not copied, only the pointers.
As explain above, outputs are like any other action, so multiple outputs for the same push are possible, including outputs with different columns and outputs for different filtering conditions.
The example above also shows actions using of hashtables as in-memory lookup tables. Sisyphus has several hashtables (in addition to the regular java HashMaps), each for a different purpose and with different benefits. These hashtables are designed to be persistent and can be saved on disk and loaded from disk.