Techniques for turning text data into data that can be analysed - PeppermintT/Learning-NLP GitHub Wiki

Tokenization For language processing, we want to break up a string (i.e text) into words and punctuation. This step is called tokenization, and it produces a list of words and punctuation. The data type is a python list.

The syntax is word_tokenize(). Eg: tokens = word_tokenize(raw)

Lemmatization Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words.

Stemming Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.

Over-stemming occurs when two words are stemmed to same root that are actually of different stems. Over-stemming can also be regarded as false-positives. Under-stemming occurs when two words are stemmed to same root that are not of different stems. Under-stemming can be interpreted as false-negatives.

Source: https://www.geeksforgeeks.org/introduction-to-stemming/

Figure 3.1 - normalizing vocab and using set() to identify distinct words.