SystemT Rewriter - Texera/texera GitHub Wiki

Authors: Qing Tang (qingt AT uci dot edu), Jinggang Diao (diaojinggang AT gmail dot edu), Flavio Bayer (flaviorbayer AT gmail dot edu).

Status: As of July 2, 2016, this task was not completed due to the complexity of the SystemT language and our limited amount of time. We will do a separate task to translate a SystemT query to a Texera plan.

Progress:

5/2: We selected some regex test cases and ran those tests with SystemT implementation. This is the result: https://drive.google.com/file/d/0B1FdPBs0KkvxYVAwVDF2dDRFdGM/view?usp=sharing. (p.s. Zuozhi ran the same tests on his laptop with Lucene implementation)

The following is a rough grammar that we have now: https://drive.google.com/file/d/0B1FdPBs0KkvxdU9jVXRqSnVJdFk/view?usp=sharing

We will try to add some more details to the grammar that we got, and parse based on this grammar.

============================================================================================================================

4/17: We have completed a simple parser model, codes will be uploaded before Monday lecture. I am trying to figure out a way to manage the relationship between each view at this time. Besides, I am trying to get familiar with OperatorGraph. Now we have successfully installed JavaCC, and we think it might be useful to generate the parse part when we design the grammar.

test case: https://drive.google.com/a/uci.edu/file/d/0B1FdPBs0KkvxY2ZCclpILXBUbHM/view?usp=sharing

Parse result:

Dict List: []

Regex List: [(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:orber)?|Nov(?:ember)?|Dec(?:ember)?) (?:19[7-9]\d|2\d{3}), (?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:orber)?|Nov(?:ember)?|Dec(?:ember)?)\s(\d|[0-2]\d|3[0-1]),\s(19\d{2}|2\d{3}), (\d|0\d|1[0-2])/(\d|[0-2]\d|3[0-1])/(19\d{2}|2\d{3}|d{2})]

Name List: [DateFormat1, DateFormat2, DateFormat3, DateUnion]

Union List: [DateUnion, (, DateUnion, (] <= This part is under consideration.

============================================================================================================================

4/11: https://docs.google.com/presentation/d/1RAxF3ZyBCPOwrOvOqM5iQhJCnjw2iLVKg1qiyviT_UA/edit#slide=id.p

============================================================================================================================ SystemT is a software package developed by IBM to support powerful information extraction.

The purpose of this task is to write a parser for the SystemT language so that we can translate a SystemT query to a query that can be answered efficiently by our Texera system that utilize its available indexing and query-processing capabilities.

Resources: