Dictionary Based Matcher - Texera/texera GitHub Wiki

Authors: Sandeep Reddy Madugala , Sudeep Meduri and Rajesh Yarlagadda

Reviewers: Chen Li

Synopsys

Lucene already provides basic functionality for performing a Keyword search and a Phrase search. We created a Dictionary Matcher feature at the top of these existing features.

The purpose of the Dictionary Matcher is to enable users to perform multiple phrase searches at a time.

Status

As of 5/25/2016: COMPLETED

Modules

edu.uci.ics.texera.dataflow.dictionarymatcher

edu.uci.ics.texera.dataflow.common

edu.uci.ics.texera.dataflow.keywordmatch

Related Issues

[Issue #90] (Team -1) - Add Keyword based and Phrase Based Dictionary Matcher

[Issue #53] (Team -1) - Design a Dictionary class for the DictionaryMatcher

[Issue #52] (Team -1) - Implement a "Span" class

[Issue #37] (Team -1) - Design: Dictionary Matcher Operator

Description

DictionaryMatcher performs a scan, keyword or a phrase based search depending on the sourceoperator type, gets the dictionary value and scans the documents for matches. Presently 2 types of KeywordOperatorTypes are supported.

There are three kinds of source operators being considered.

  • SCANOPERATOR
  • KEYWORDOPERATOR
  • PHRASEOPERATOR

#####SourceOperatorType.SCANOPERATOR:

Loops through the dictionary entries. For each dictionary entry, loop through the tuples in the operator. For each tuple, loop through the fields in the attributelist. For each field, loop through all the matches. Returns only one tuple per document. If there are multiple matches, all spans are included in a list.

Java Regex is used to match word boundaries.

Ex: If dictionary word is "Lin", and text is "Lin is Angelina's friend", matches should include Lin but not Angelina.

#####SourceOperatorType.KEYWORDOPERATOR:

Loops through the dictionary entries. For each dictionary entry, keywordmatcher's getNextTuple is called using KeyWordOperator.BASIC. Updates span information at the end of the tuple.

#####SourceOperatorType.PHRASEOPERATOR:

Loops through the dictionary entries. For each dictionary entry, keywordmatcher's getNextTuple is called using KeyWordOperator.PHRASE. The span returned is the span information provided by the keywordmatcher's phrase operator.

Presentation

Lucene Presentation (Team 1)

Performance Test

Machine configuration : MacBook Pro, 2.7 GHz Intel Core i5, 8 GB 1867 MHz DDR3

Dataset: 100k medline record

  • index time: 29.4110 seconds

  • Performance results for DictionaryMatcher with SCANOPERATOR:

  • Dictionary : {"medical"}

  • Lucene Query time: 0.1480 seconds

  • Match time: 5.2740 seconds

  • Total: 2459 results

  • Performance results for DictionaryMatcher with PHRASEOPERATOR:

  • Dictionary : {"medical"}

  • Lucene Query time: 0.3840 seconds

  • Match time: 0.5980 seconds

  • Total: 2459 results

  • Performance results for DictionaryMatcher with SCANOPERATOR:

  • Dictionary : {"medical","medication"}

  • Lucene Query time: 0.4430 seconds

  • Match time: 10.9500 seconds

  • Total: 2904 results

  • Performance results for DictionaryMatcher with PHRASEOPERATOR:

  • Dictionary : {"medical","medication"}

  • Lucene Query time: 0.4560 seconds

  • Match time: 0.8950 seconds

  • Total: 2904 results

  • Performance results for DictionaryMatcher with PHRASEOPERATOR:

  • Dictionary : {"medical","medication","medicare","medicaid"}

  • Lucene Query time: 0.5210 seconds

  • Match time: 0.9100 seconds

  • Total: 3022 results

Dataset: 1M medline record

  • index time: 335.6620 seconds

  • Performance results for DictionaryMatcher with SCANOPERATOR:

  • Dictionary : {"medical"}

  • Lucene Query time: 0.9840 seconds

  • Match time: 53.0320 seconds

  • Total: 29355 results

  • Performance results for DictionaryMatcher with PHRASEOPERATOR:

  • Dictionary : {"medical"}

  • Lucene Query time: 0.5870 seconds

  • Match time: 5.2180 seconds

  • Total: 29355 results

  • Performance results for DictionaryMatcher with PHRASEOPERATOR:

  • Dictionary : {"medical","medication","medicare","medicaid"}

  • Lucene Query time: 0.5950 seconds

  • Match time: 5.6970 seconds

  • Total: 36528 results