NLTK(Natural Language Toolkit) and text preprocessing - engr-swati-sharma/Applied-AI-Assignments-Case-Studies GitHub Wiki

What is nltk?

It stands for Natural Language Toolkit. Textbook for Python2 is available here

What tasks can be done with nltk?

You can access the corpus, process strings, collocation discovery(t-test, chi-squared, point wise mutual information), Part of speech tagging, classification, chunking, parsing, semantic interpretation, evaluation metrics, probability and estimation.

What is lexical analysis?

Lexical analysis, lexing or tokenization is the process of converting a sequence of characters into a sequence of tokens. Tokens are strings with an assigned and an identified meaning. A program that performs lexical analysis may be termed a lexer, tokenizer.

What are the various text preprocessing techniques?

Tokenization, Stop word removal, Lemmatization and Stemming.

What is the idea behind text preprocessing?

In order to retrieve useful information from a corpora of text, we do text preprocessing. So, we need to decide which documents in a collection should be retrieved to satisfy a user's need for information. A user has a query, which contain one or more search term, plus some additional information, like weight associated with that word. Hence the retrieval decision is made by comparing the terms of the query with the index terms(important words) appearing in the document itself.

The decision may be binary, whether to retrieve or reject, or it may involve estimating the degree of relevance that the document has to the query.

Often, the words that appear in documents and in queries have many structural variants. So before information retrieval from the documents, the data preprocessing techniques are applied on the target data set to reduce the size of the data which will eventually increase the effectiveness of the Information Retrieval System.

What are the steps for text-preprocessing?

Remove the hyperlinks in the document/corpus.
Remove any special characters or punctuations(like, . or #) in the document/corpus.
Remove the alphanumeric/numbers in the document/corpus.
Find and remove the html tags if it is in the document/corpus
Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters).
Convert the word to lowercase.
Remove the stops words
Apply tokenization to get the list of words from the document/corpus.
Apply stemming(snowball- it was observed to be better than Porter Stemming)/lemmatization on the list of words based on the problem we are going to solve. Follow the complete steps here

What is tokenization?

It is the process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens. It is to figure out words in a sentence. The list of tokens becomes input for further processing such as parsing or text mining.

The main use of tokenization is identifying the meaningful keywords.

It is useful both in linguistics and computer science where it forms part of lexical analysis.

What are stop words and why we remove it from the text?

Stop words are those which re-occur very frequently and are meaningless as they are used to join words together in a sentence. Since they don't contribute to the context or content of textual documents, their presence in text mining presents an obstacle in understanding the content of the documents. Some of the category of stop words are: numbers, date formats and most common words like preposition, articles and pro-nouns, which can be eliminated.

Some examples of stop words are: and, are, this, etc. They are not useful in classification of documents, so they must be removed.

What is meant by Lemmatization?

Lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. Lemmatization depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document.

For eg: beautiful and beautifully are lemmatised to beautiful and beautifully respectively without changing the meaning of the words. But, good, better and best are lemmatised to good since all the words have similar meaning.

Lemmatisation doesn't work according to the meaning of the word but it works according to the parts of speech of the word in the sentence.

Some libraries are capable to convert sentences/each words according to the parts of speech in the sentence itself like StanfordCoreNLP.

What is meant by stemming?

Stemming is a process of reducing words to its root form even if the root has no dictionary meaning.

For eg: beautiful and beautifully will be stemmed to beauti which has no meaning in English dictionary.(Every term is reduced to the root term)

What will be the lemmatised word for tasty and tasteless?

nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("tasty"))
tasty
print(lemmatizer.lemmatize("tasteless"))
tasteless

What will be the lemmatised word for studies and studying?

print(lemmatizer.lemmatize("studies"))
study
print(lemmatizer.lemmatize("studying"))
studying

What will be the stemmed word for "tasty","tasteless","delicious"?

from nltk.stem import PorterStemmer
ps = PorterStemmer()

example_words = ["tasty","tasteless","delicious"]

for w in example_words:
print(ps.stem(w))
tasti,tasteless,delici

import nltk
nltk.download('wordnet')
lmtzr = WordNetLemmatizer()
lmtzr.lemmatize('tasteless')
lmtzr.lemmatize('delicious')
lmtzr.lemmatize('tasty')
tasty

What works for NLP tasks, Stemming or Lemmatization?

Stemming is the process of reducing a word into its stem, i.e. its root form. The root form is not necessarily a word by itself, but it can be used to generate words by concatenating the right suffix.

For example, the words fish, fishes and fishing all stem into fish, which is a correct word. On the other side, the words study, studies and studying stems into studi and the words like cries and cry stems into cri which is not an English word.

But lemmatizer provides different lemma for both tokens study for studies and studying for studying.

So when we need to make feature set to train machine, it would be great if lemmatization is preferred.

What are the advantages and disadvantages of stemming over lemmatization?

Advantages obviously include shortening the vocabulary space, thus drastically improving the size of the index (or feature space). Strictly dictionary based or rule based stemmers (e.g. Porter stemmer) are very fast.

The popular implementations for stemming are still rule based. Not all the words can be stemmed to their right root word. There are several disadvantages we have to trade for speed of stemming process.

Stemming can't relate words which have different forms based on grammatical constructs, like is, am, be, all represent the same root verb, be. But stemming can't prune them to the common form. The word better should be resolved to good, but stemmers would fail to do that.

With stemming, there is lot of ambiguity which may cause several false positives. Axes is both a plural form of axe and axis. By chopping of the "s", you won't be able to relate x axis in a plane with x-y axes. Further, stemming is harder for several languages like Hebrew and Arabic.

Give a simpler example of Stemming and Lemmatization.

Amusement : Amus (Stemming) Amusement : Amuse (Lemmatization)