Token based Fuzzy Match - Texera/texera GitHub Wiki

Authors: Varun Bharill, Parag Saraogi.

Reviewers:

Ideas from paper - "Efficient Merging and Filtering Algorithms for Approximate String Searches"

  1. Creating N-gram inverted lists. A single document can be treated as a string and all the N grams of each documents need to be indexed. The final result of this task would be an inverted index of N-grams Vs document-ID.

Questions: a) How do we fix "N" ? b) For a particular gram are the document IDs stored in sorted order ?

  1. T-occurance problem. Given a Query string (q), compute all N-grams. From the index created in step 1, we have a list of documents in which each gram of query string appears. The T-occurance problem is to find the documents with at least T occurances. Available algorithms - a. Heap count b. Scan Count c. Divide skip etc.

Questions: Lets say, a phrase "University of california" appears in a certain document d1. Assume N = 5. So all the 5-grams of this phase have been indexed. Assume query string q = "University california". From step 2, many 5-grams of the query string will map to different documents. The index created in step 1 might as well give the location of the grams in the document. Based on the mappings of the 5-grams of query string, can we identify a relevant phrase in the document ?

  1. Filtering

We have studied various filtering mechanisms like Length filtering, Position filtering, Prefix filtering which can be combined in the form of a tree structure, where every level corresponds to a filter. Also we studied the sequence in which these filters should be applied to gain maximum performance advantages.

Presentation on Token based fuzzy matching - https://docs.google.com/presentation/d/123dmjHnazpfU82TVmlT3OXfcCGlK6o0NgLUWAUGj_ns/edit#slide=id.g12b64ffe2d_0_259