Outputs - Titousensei/sisyphus GitHub Wiki

Outputs are the data actually used for each row, either stored in files, or in memory. File outputs will throw an execution exception by default if the file already exists. This is to prevent the process overwriting its own data by mistake.

Outputs declares their schema, generally only a subset of the current row. The columns can be in a different order.

OutputFile: write each row into a file.

Keys and HashMaps can be populated using Outputs: OutputKey, OutputKeyMap, OutputKeyBinding, OutputKeyDouble, OutputHashMap

OutputFileSplit (abstract): write each row into one of the files of a group. The most used implementation is SplitByOneColumn, where each row goes into the file with the name corresponding to the content of one column. Re-using the same example, one Output could split the DVD titles into one file per "category": titles_action.tsv.gz, titles_comedy.tsv.gz, etc.

OutputSortSplit: used to sort files.

Outputs to files also creates a metadata file ".meta.<filename>", which currently contains only the schema. This file is just for information at this time, but future releases will use the data to automatically populate the schema of inputs and enforce column sorting verifications for joins.

Custom outputs can be implemented easily by extending the class OutputCustom. See OutputConcatRows in examples for a class that merges rows.

Previous: Inputs - Next: Keys