SystemT Rewriter - Texera/texera GitHub Wiki
Authors: Qing Tang (qingt AT uci dot edu), Jinggang Diao (diaojinggang AT gmail dot edu), Flavio Bayer (flaviorbayer AT gmail dot edu).
Status: As of July 2, 2016, this task was not completed due to the complexity of the SystemT language and our limited amount of time. We will do a separate task to translate a SystemT query to a Texera plan.
Progress:
5/2: We selected some regex test cases and ran those tests with SystemT implementation. This is the result: https://drive.google.com/file/d/0B1FdPBs0KkvxYVAwVDF2dDRFdGM/view?usp=sharing. (p.s. Zuozhi ran the same tests on his laptop with Lucene implementation)
The following is a rough grammar that we have now: https://drive.google.com/file/d/0B1FdPBs0KkvxdU9jVXRqSnVJdFk/view?usp=sharing
We will try to add some more details to the grammar that we got, and parse based on this grammar.
============================================================================================================================
4/17: We have completed a simple parser model, codes will be uploaded before Monday lecture. I am trying to figure out a way to manage the relationship between each view at this time. Besides, I am trying to get familiar with OperatorGraph. Now we have successfully installed JavaCC, and we think it might be useful to generate the parse part when we design the grammar.
test case: https://drive.google.com/a/uci.edu/file/d/0B1FdPBs0KkvxY2ZCclpILXBUbHM/view?usp=sharing
Parse result:
Dict List: []
Regex List: [(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:orber)?|Nov(?:ember)?|Dec(?:ember)?) (?:19[7-9]\d|2\d{3}), (?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:orber)?|Nov(?:ember)?|Dec(?:ember)?)\s(\d|[0-2]\d|3[0-1]),\s(19\d{2}|2\d{3}), (\d|0\d|1[0-2])/(\d|[0-2]\d|3[0-1])/(19\d{2}|2\d{3}|d{2})]
Name List: [DateFormat1, DateFormat2, DateFormat3, DateUnion]
Union List: [DateUnion, (, DateUnion, (] <= This part is under consideration.
============================================================================================================================
4/11: https://docs.google.com/presentation/d/1RAxF3ZyBCPOwrOvOqM5iQhJCnjw2iLVKg1qiyviT_UA/edit#slide=id.p
============================================================================================================================ SystemT is a software package developed by IBM to support powerful information extraction.
The purpose of this task is to write a parser for the SystemT language so that we can translate a SystemT query to a query that can be answered efficiently by our Texera system that utilize its available indexing and query-processing capabilities.
Resources:
-
SystemT is available externally as BigInsights Text Analytics. You can get a copy of BigInsights from the following link: http://www-01.ibm.com/support/docview.wss?uid=swg24040517.
-
The full specification of AQL can be found at http://www.ibm.com/support/knowledgecenter/SSPT3X_4.0.0/com.ibm.swg.im.infosphere.biginsights.aqlref.doc/doc/aql-overview.html.
-
For students who are interested in building extractors via UI, via BlueMix as instructed here: http://researcher.watson.ibm.com/researcher/view_group_subpage.php?id=6335
-
SystemT is a proprietary product of IBM. With the kind support from our IBM colleagues, we can access it for education purposes. If you want to access the package, please contact the instructor (Prof. Chen Li).