Inputs - Titousensei/sisyphus GitHub Wiki

Inputs are the data that the Pusher iterates on. It populates the current row with the entry from a single input at a time, until all the inputs are processed.

The most common Inputs are text files, gzipped or not, in TSV format:

  • InputFile: single file, options to skip a header.
  • InputFileGroup: wildcard matching of the filenames, read each file in a serial manner.
  • InputBinayFile: single file in binary format, with fixed-length records. Each column can be any number of bytes. Values are presented as decimal or hexadecimal.

Sisyphus can also iterate through the different types of hashtables: InputKey, InputKeyMap, InputKeyBinding, InputKeyDouble

There is also a few special Inputs used for sorting and joining: InputMergeSorted, InputJoinSorted

Custom inputs can be implemented easily by extending a base class:

  • InputYielder: the simplest way is to implement an input, where you can use a generator pattern. See examples.InputRange for a class that generate rows with a counter.
  • InputCustom: to chain input pre-processing, similar to chaining java Streams. See examples.InputSplitRows for a class that splits merged rows.
  • Input: of course, you always have the option to extend the most generic base class.

Previous: Schemas - Next: Outputs