Keyword Match Operator - Texera/texera GitHub Wiki
Authors: Akshay Jain, Prakul Agarwal
Reviewers: Chen Li
Synopsis
Implement an operator wrapping Lucene's search capability to support perform keyword and phrase search.
Status
As of 5/30/2016: COMPLETED
Modules:
edu.uci.ics.texera.dataflow.common
edu.uci.ics.texera.dataflow.keywordmatch
Related Issues:
https://github.com/Texera/texera/issues/31
Description
Keyword Operator performs Keyword Search and Phrase Search. It implements an iterator-based design, and the getNextTuple()
function should be used to get the next result.
Keyword Search:
It take as Keyword Predicate as the input with a query type as KeywordOperator.BASIC
. It uses IndexBasedScanOperator
, which returns a superset of the desired results. It then filters these results and updates the Span
information accordingly.
Phrase Search:
It takes as Keyword Predicate as the input with a query type as KeyWordOperator.PHRASE
. It uses IndexBasedScanOperator
. Using the results and Span
information for the IndexBasedScanOperator
, it extracts the exact text from the document and updates the Span
information accordingly.
Performance Test
Machine configuration : MacBook Pro (Early 2011), 2.3 GHz Intel Core i5, 4 GB 1333 MHz DDR3
- Dataset: 100k medline record
- Performance results for KeywordMatcher with KeywordOperatorType.BASIC :
Index time: 59.8610 seconds.
- Query : "medicine"
Lucene Query time: 1.3160 seconds.
Match time: 10.7240 seconds.
Total: 539 results.
- Query : "medicine history"
Lucene Query time: 1.8380 seconds
Match time: 0.4580 seconds
Total: 23 results
- Dataset: 1million medline record
- Performance results for KeywordMatcher with KeywordOperatorType.BASIC :
index time: 655.8610 seconds
- Query : "medicine"
Lucene Query time: 4.8050 seconds
Match time: 7192.9650 seconds
Total: 9114 results
- Query : "medicine history"
Lucene Query time: 6.8490 seconds
Match time: 18.4780 seconds
Total: 514 results