MIN EDIT DISTANCE NLP(Levenshtein distance) - RTalha/Natural-Language-Processing GitHub Wiki
Min Edit Distance!
Much of natural language processing is concerned with measuring how similar two strings are. For example in spelling correction, the user typed some erroneous string—let’s say ## graffe We want to know what the user meant. The user probably intended a word that is similar to graffe. Among candidate similar words,
grail,
graf,
giraffe
The word ## giraffe, differs by only one letter and seems intuitively to be more similar. Another example comes from co-reference, the task of deciding whether two strings refer to the same entity:
Stanford President John Hennessy.
Stanford University President John Hennessy
Again, the fact that these two strings are very similar (differing by only one word) seems like useful evidence for deciding that they might be co-referent.
Edit distance gives us a way to quantify both of these intuitions about string similarity. More formally, the minimum edit distance between two strings is defined as the minimum number of editing operations (operations like insertion, deletion, substitution) needed to transform one string into another.
The gap between intention and execution, for example, is 5 (delete an i, substitute e for n, substitute x for t, insert c, substitute u for n).
Levenshtein also proposed an alternative version of his metric in which each insertion or deletion has a cost of 1 and substitutions are not allowed. This is equivalent to allowing substitution, but giving each substitution a cost of 2 since any substitution can be represented by one insertion and one deletion. Using this version, the Levenshtein distance between intention and execution is 8.