Join Operator - Texera/texera GitHub Wiki

Author: Sripad Kowshik Subramanyam

Synopsys

Implement an operator that takes two operators as the input and joins their tuples based on constraints specified using a predicate.

Status

As of 9/25/2016: COMPLETED

Modules

edu.uci.ics.texera.dataflow.common
edu.uci.ics.texera.dataflow.join

Related Issues

https://github.com/Texera/texera/issues/111

Description

Join Operator performs the join of a certain field of the results of two other operators passed to it based on constraints specified in a join predicate. The field to join upon and the constraints to be satisfied are specified using JoinPredicate. The getNextTuple() method is used to get the next result of the operator.

Currently supported predicates are:

  • JoinDistancePredicate: Takes in an attribute that specifies the ID, the attribute of the field to perform the join on, and a distance threshold. If the distance between two spans of the field of the results to be joined is within the threshold, the join is performed.

Example

Given below is a setting and corresponding examples to use JoinDistancePredicate (consider the two tuples to be from two different operators).

id author review spanList
tuple1 58 Bruce Wayne This book gives us a peek into the life of Bruce Wayne when he is not fighting crime as Batman "book":<6,11>
tuple2 58 Bruce Wayne This book gives us a peek into the life of Bruce Wayne when he is not fighting crime as Batman "gives":<12, 18>,
"us":<19, 22>

Where <spanStartIndex, spanEndIndex> represents a span.

If we want to join over the review attribute with the condition within 10 character distance, we can write:

JoinDistancePredicate joinPredicate = new JoinDistancePredicate(idAttr, reviewAttr, 10);

Since both tuples have the same ID, we can perform the join on the two span lists.

The span distance is computed as:

|(span 1 spanStartIndex) - (span 2 spanStartIndex)| OR |(span 1 spanEndIndex) - (span 2 spanEndIndex)|)

Upon performing Join on the above two tuples, we get:

  1. The span "book":<6,11> from tuple1 and the span "gives":<12, 18> from tuple2 satisfy the condition distance <= threshold. Therefore, the join will combine two spans into a new span "book_gives":<6, 18>.

  2. The span "book":<6,11> from tuple1 and the span "us":<19, 22> from tuple2 don't satisfy the condition, so they will not be joined.

TODOs

  • Implement sorting of spans of the results in order to improve the performance of the operator.
  • Implement other kinds of predicates to increase the robustness and utility of the operator.
⚠️ **GitHub.com Fallback** ⚠️