Query Rewriter - Texera/texera GitHub Wiki
Authors: Shiladitya Sen, Kishore Narendran
Reviewer: Chen Li (DONE)
Synopsis
The purpose of the "QueryRewriter" operator is to correct errors of missing spaces in a query that can lead to incorrect tokenization. For instance, a query "newyork" can be rewritten by this operator to "new york". The operator is be used to return:
- The most likely rewritten query found using a word-frequency dictionary; or
- A set of valid rewritten queries.
Status
As of 6/3/2016: COMPLETED
Modules
edu.uci.ics.texera.dataflow.queryrewriter
Related Issues
Design: Query Rewriter Issue - https://github.com/Texera/texera/issues/29
Description
The operator inserts spaces to a query string to find likely words in order to rewrite the query. It has two implementations:
-
A dynamic programming algorithm that utilizes a word-frequency dictionary to find the most likely tokenization. This algorithm was adopted from the Chinese characters tokenization performed in the [Srch2 Chinese Tokenization] module (https://github.com/SRCH2/srch2-ngn/blob/master/src/core/analyzer/ChineseTokenizer.cpp#L197). The word-frequency dictionary was derived from Google unigrams and the NLTK English dictionary. The score for each word used for the algorithm is a reciprocal of frequency.
-
A recursive algorithm that uses an English dictionary (possibly without word frequencies) to find all combinations of valid tokenizations in a search string. This algorithm that can be found here