Join Operator - Texera/texera GitHub Wiki
Author: Sripad Kowshik Subramanyam
Implement an operator that takes two operators as the input and joins their tuples based on constraints specified using a predicate.
As of 9/25/2016: COMPLETED
edu.uci.ics.texera.dataflow.common
edu.uci.ics.texera.dataflow.join
https://github.com/Texera/texera/issues/111
Join Operator performs the join of a certain field of the results of two other operators passed to it based on constraints specified in a join predicate. The field to join upon and the constraints to be satisfied are specified using JoinPredicate
. The getNextTuple()
method is used to get the next result of the operator.
Currently supported predicates are:
-
JoinDistancePredicate
: Takes in an attribute that specifies the ID, the attribute of the field to perform the join on, and a distance threshold. If the distance between two spans of the field of the results to be joined is within the threshold, the join is performed.
Given below is a setting and corresponding examples to use JoinDistancePredicate
(consider the two tuples to be from two different operators).
id | author | review | spanList | |
---|---|---|---|---|
tuple1 | 58 | Bruce Wayne | This book gives us a peek into the life of Bruce Wayne when he is not fighting crime as Batman | "book":<6,11> |
tuple2 | 58 | Bruce Wayne | This book gives us a peek into the life of Bruce Wayne when he is not fighting crime as Batman | "gives":<12, 18>, "us":<19, 22> |
Where <spanStartIndex, spanEndIndex>
represents a span.
If we want to join over the review attribute with the condition within 10 character distance, we can write:
JoinDistancePredicate joinPredicate = new JoinDistancePredicate(idAttr, reviewAttr, 10);
Since both tuples have the same ID, we can perform the join on the two span lists.
The span distance is computed as:
|(span 1 spanStartIndex) - (span 2 spanStartIndex)| OR |(span 1 spanEndIndex) - (span 2 spanEndIndex)|)
Upon performing Join on the above two tuples, we get:
-
The span
"book":<6,11>
from tuple1 and the span"gives":<12, 18>
from tuple2 satisfy the condition distance <= threshold. Therefore, the join will combine two spans into a new span"book_gives":<6, 18>
. -
The span
"book":<6,11>
from tuple1 and the span"us":<19, 22>
from tuple2 don't satisfy the condition, so they will not be joined.
- Implement sorting of spans of the results in order to improve the performance of the operator.
- Implement other kinds of predicates to increase the robustness and utility of the operator.