Keyword Match Operator - Texera/texera GitHub Wiki

Authors: Akshay Jain, Prakul Agarwal

Reviewers: Chen Li

Synopsis

Implement an operator wrapping Lucene's search capability to support perform keyword and phrase search.

Status

As of 5/30/2016: COMPLETED

Modules:

edu.uci.ics.texera.dataflow.common
edu.uci.ics.texera.dataflow.keywordmatch

Related Issues:

https://github.com/Texera/texera/issues/31

Description

Keyword Operator performs Keyword Search and Phrase Search. It implements an iterator-based design, and the getNextTuple() function should be used to get the next result.

Keyword Search:

It take as Keyword Predicate as the input with a query type as KeywordOperator.BASIC. It uses IndexBasedScanOperator, which returns a superset of the desired results. It then filters these results and updates the Span information accordingly.

Phrase Search:

It takes as Keyword Predicate as the input with a query type as KeyWordOperator.PHRASE. It uses IndexBasedScanOperator. Using the results and Span information for the IndexBasedScanOperator, it extracts the exact text from the document and updates the Span information accordingly.

Performance Test

Machine configuration : MacBook Pro (Early 2011), 2.3 GHz Intel Core i5, 4 GB 1333 MHz DDR3

  • Dataset: 100k medline record
  • Performance results for KeywordMatcher with KeywordOperatorType.BASIC :

Index time: 59.8610 seconds.

  • Query : "medicine"

Lucene Query time: 1.3160 seconds.

Match time: 10.7240 seconds.

Total: 539 results.

  • Query : "medicine history"

Lucene Query time: 1.8380 seconds

Match time: 0.4580 seconds

Total: 23 results

  • Dataset: 1million medline record
  • Performance results for KeywordMatcher with KeywordOperatorType.BASIC :

index time: 655.8610 seconds

  • Query : "medicine"

Lucene Query time: 4.8050 seconds

Match time: 7192.9650 seconds

Total: 9114 results

  • Query : "medicine history"

Lucene Query time: 6.8490 seconds

Match time: 18.4780 seconds

Total: 514 results