Joining - Titousensei/sisyphus GitHub Wiki

In the SQL world, even simple lookups are expressed as joins. In Sispyhus, lookups are done using Keys or HashMaps. Only multi-columns joins need to be expressed as joins.

Joins in Sisyphus can only be performed with inputs sorted on the same columns. Data sets (files or keys) can be sorted using OutputSortSplit (see Sorting). Please note that InputMergeSorted can be used directly by the join and it is not necessary to merge the split files into one. To summarize, joins need 2 steps: first step is to sort the data, second step is to join the data.

InputJoinSorted(String[] join_schema, Input... inputs) will load rows from each input as necessary and performs an inner join, according to the join_schema, into the current row. The columns in the join_schema will only appear once in the current row. If different inputs have columns with the same name, only the value from the first value (in order of the inputs declaration) will be used for the current row. (This behavior might change in the future.)

Example: InputJoinSorted

input1 -> [id, a1, b1]
input2 -> [id, a2]
input3 -> [a3, id]
join_schema -> [id]
[Pusher] ... row schema: [id, a1, b1, a2, a3]

InputLeftJoinSorted performs a left outer join. The first input is the "left" input, and all its rows will be present. The remaining inputs are "right" inputs and some their rows will be empty if they don't join.

InputSelfJoinSorted performs a self join of one input, while loading the data only once. The schema of the current row will use the column names with the prefix "r." for the "right" version of the input.

Example: InputSelfJoinSorted

input1 -> [id, a1, b1]
join_schema -> [id]
[Pusher] ... row schema: [id, a1, b1, r.a1, r.b1]

Previous: Sorting - Next: Other Useful Tools