ICP7 - narhirep/Python-Deep-Learning GitHub Wiki

Welcome to the In Class Programming 7:

Description: This assignment is regarding Natural Language Processing (NLP) using Natural Language ToolKit (NLTK). In this ICP we are going to focus on text processing like unigram, bigram, trigram, tokenization, pos tagging, lemmatization, normalization, entity extraction, language model.

Objective: To learn and implement basic NLP techniques like unigram, bigram, trigram, tokenization, pos tagging, lemmatization, normalization, entity extraction, language model.

Implementation:

1. Change the classifier in the given source code to: a. SVM and see how accuracy changes. b. Set the tfidf vectorizer parameter to use bigram and see how the accuracy changes Tfidf Vectorizer (ngram_range=(1,2)). c. Set tfidf vectorizer argument to use stop_words='english' and see how accuracy changes.

CODE: 1 OUTPUT:

2. Extract the following web URL text using BeautifulSoup and save the result in a file “input.txt”. Apply the following on the “input.txt” file https://en.wikipedia.org/wiki/Google. •Tokenization •POS •Stemming •Lemmatization •Trigram •Named Entity Recognition

•Tokenization

Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The tokens may be the sentences, words, numbers or punctuation marks.

CODE: Below is the code for Tokenization where I have imported BeautifulSoup library for doing Http request to a URL: "https://en.wikipedia.org/wiki/Google" and doing HTML parsing.

OUTPUT:

•POS

The process of classifying the words in a text(corpus) into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. In English the main parts of speech are noun, pronoun, adjective, determiner, verb, adverb, preposition, conjunction and interjection.

CODE: OUTPUT:

•Stemming

Stemming is the process for reducing injected words to their stem, base root form.

CODE: OUTPUT:

•Lemmatization

Lemmatization process involves first determining the part of speech of a word and applying different normalization rules for each part of speech.

CODE: OUTPUT:

•Trigram

An N-gram is a contiguous sequence of n items from a given sample of text or speech. So trigram will perform as below like taking 3 words and continue.

CODE: OUTPUT:

•Named Entity Recognition

CODE: OUTPUT:

Video: ICP7

Conclusion: In this ICP I have learnt about Natural Language Processing and Natural Language ToolKit python module. I have also learnt to implement basic NLP techniques like unigram, bigram, trigram, tokenization, pos tagging, lemmatization, normalization, entity extraction, language model.

Downloading and installing NLTK on OSX was bit hard for me as I am new to this OS.