Regex Matcher - Texera/texera GitHub Wiki
Author(s): Zuozhi Wang, Shuying Lai
Reviewer(s): Chen Li REVIEWED
Synopsys
Implement an operator to use an index-based method based on gram inverted index to support efficient regular expressions on large datasets. Our implementation is based on Russ Cox's algorithm and the corresponding open source implementation.
Status
As of 6/14/2016: FINISHED
Modules
edu.uci.ics.texera.dataflow.regexmatch
com.google.re2j
Related Issues
https://github.com/Texera/texera/issues/30 and https://github.com/Texera/texera/issues/99
Description
Our approach has the following steps:
- Build a gram index for documents using Lucene.
- Translate a regular expression to a boolean expression using Russ Cox's algorithm.
- Run the Boolean query on Lucene to compute candidate documents.
- Postprocess these candidate documents to remove false positives.
Presentation
[Presentation] (https://docs.google.com/presentation/d/1F3Xboeb_azHSjWbJ2Cl36kGHpIeo_6-lI24XwXjq_hA/edit?usp=sharing) (Team 3)
Performance Test
Machine setting: Macbook Air (mid-2013), Intel Core i7 (4650U), SSD hard drive, 8GB memory.
- Data set: 1 million Medline records, about 1.53 GB
- Regex:
\bmedic(ine|al|ation|are|aid)?\b
("medicine" or "medical" or "medication" or "medicare" or "medicaid") - Performance results (average time reported in seconds):
Index Based | Scan Based | |
---|---|---|
Regex Matcher | 7.43 | 67.65 |
RE2J | 11.88 | 99.05 |
- Data set: 5 million Medline records, about 7.8 GB.
- Regex:
\bmedic(ine|al|ation|are|aid)?\b
("medicine" or "medical" or "medication" or "medicare" or "medicaid") - Performance results (average time reported in seconds):
Index Based | Scan Based | |
---|---|---|
Regex Matcher | 28.11 | 330.754 |
RE2J | 44.72 | 506.34 |
TODOs
- Design and implement algorithm to compare two boolean expressions. The basic idea is to format boolean expressions into DNF, and check the equivalence of two DNFs. There is a PR for this task: https://github.com/Texera/texera/pull/127 FINISHED
- Use token position information to generate a more selective query to reduce the number of candidate documents.