Regex Matcher - Texera/texera GitHub Wiki

Author(s): Zuozhi Wang, Shuying Lai

Reviewer(s): Chen Li REVIEWED

Synopsys

Implement an operator to use an index-based method based on gram inverted index to support efficient regular expressions on large datasets. Our implementation is based on Russ Cox's algorithm and the corresponding open source implementation.

Status

As of 6/14/2016: FINISHED

Modules

edu.uci.ics.texera.dataflow.regexmatch
com.google.re2j

Related Issues

https://github.com/Texera/texera/issues/30 and https://github.com/Texera/texera/issues/99

Description

Our approach has the following steps:

  • Build a gram index for documents using Lucene.
  • Translate a regular expression to a boolean expression using Russ Cox's algorithm.
  • Run the Boolean query on Lucene to compute candidate documents.
  • Postprocess these candidate documents to remove false positives.

Presentation

[Presentation] (https://docs.google.com/presentation/d/1F3Xboeb_azHSjWbJ2Cl36kGHpIeo_6-lI24XwXjq_hA/edit?usp=sharing) (Team 3)

Initial Design Slides

Performance Test

Machine setting: Macbook Air (mid-2013), Intel Core i7 (4650U), SSD hard drive, 8GB memory.

  • Data set: 1 million Medline records, about 1.53 GB
  • Regex: \bmedic(ine|al|ation|are|aid)?\b ("medicine" or "medical" or "medication" or "medicare" or "medicaid")
  • Performance results (average time reported in seconds):
Index Based Scan Based
Regex Matcher 7.43 67.65
RE2J 11.88 99.05
  • Data set: 5 million Medline records, about 7.8 GB.
  • Regex: \bmedic(ine|al|ation|are|aid)?\b ("medicine" or "medical" or "medication" or "medicare" or "medicaid")
  • Performance results (average time reported in seconds):
Index Based Scan Based
Regex Matcher 28.11 330.754
RE2J 44.72 506.34

TODOs

  • Design and implement algorithm to compare two boolean expressions. The basic idea is to format boolean expressions into DNF, and check the equivalence of two DNFs. There is a PR for this task: https://github.com/Texera/texera/pull/127 FINISHED
  • Use token position information to generate a more selective query to reduce the number of candidate documents.