Sorting - Titousensei/sisyphus GitHub Wiki

One important tool of data processing is sorting: once the data is sorted, it becomes easy to do special processing that involves some kind of grouping or joining.

Sisyphus provides the sorting functionality as an Output, in the form of OutputSortSplit. The data is split into large sorted chunks, (sorting is done in memory before writing on disk,) and saved into individual files with the same schema.

The user can define the memory allocation (typically enough for 1-10M rows), and the comparator to use: RowComparator (the default, for String comparisons) and RowComparatorInteger are provided, but custom comparators can be easily implemented.

The result is a group of files, each individually sorted. In most cases, these files will be used immediately for further processing, such as joins. Sometimes, we just want to merge the files and get one big sorted file. In both cases, we will use InputMergeSorted, which will open all the splits at the same time, and read the row in order (assuming the same comparator is used).

This one-and-a-half-step sorting technique allows Sisyphus to sort very large files quite fast. Most of the time, the split files are temporary files and can be discarded.

OutputSortSplit has a neat feature for data that is already partially sorted: the biggest value of the previous chunk is remembered, and it will keep appending to the same file as long as it can if the data is bigger than than the biggest value of the previous chunk. This way, partially sorted data will have less splits.

OutputSortSplit.sortOneFile() is a utility method implementing the simplest and most used case of sorting just one file on some columns. It's equivalent to the following code:

Input in_rows = new InputFile(input_file, file_schema);
Output sort = new OutputSortSplit(output_file, batch_size, sort_schema, file_schema);
new Pusher()
    .always(sort)
    .push(in_rows);

Previous: Break Actions - Next: Joining