ICP7 - narhirep/Python-Deep-Learning GitHub Wiki
Welcome to the In Class Programming 7:
Description: This assignment is regarding Natural Language Processing (NLP) using Natural Language ToolKit (NLTK). In this ICP we are going to focus on text processing like unigram, bigram, trigram, tokenization, pos tagging, lemmatization, normalization, entity extraction, language model.
Objective: To learn and implement basic NLP techniques like unigram, bigram, trigram, tokenization, pos tagging, lemmatization, normalization, entity extraction, language model.
Implementation:
1. Change the classifier in the given source code to: a. SVM and see how accuracy changes. b. Set the tfidf vectorizer parameter to use bigram and see how the accuracy changes Tfidf Vectorizer (ngram_range=(1,2)). c. Set tfidf vectorizer argument to use stop_words='english' and see how accuracy changes.
CODE:
OUTPUT:
2. Extract the following web URL text using BeautifulSoup and save the result in a file “input.txt”. Apply the following on the “input.txt” file https://en.wikipedia.org/wiki/Google. •Tokenization •POS •Stemming •Lemmatization •Trigram •Named Entity Recognition
•Tokenization
Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The tokens may be the sentences, words, numbers or punctuation marks.
CODE: Below is the code for Tokenization where I have imported BeautifulSoup library for doing Http request to a URL: "https://en.wikipedia.org/wiki/Google" and doing HTML parsing.
OUTPUT:
•POS
The process of classifying the words in a text(corpus) into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. In English the main parts of speech are noun, pronoun, adjective, determiner, verb, adverb, preposition, conjunction and interjection.
CODE:
OUTPUT:
•Stemming
Stemming is the process for reducing injected words to their stem, base root form.
CODE:
OUTPUT:
•Lemmatization
Lemmatization process involves first determining the part of speech of a word and applying different normalization rules for each part of speech.
CODE:
OUTPUT:
•Trigram
An N-gram is a contiguous sequence of n items from a given sample of text or speech. So trigram will perform as below like taking 3 words and continue.
CODE:
OUTPUT:
•Named Entity Recognition
CODE:
OUTPUT:
Video: ICP7
Conclusion: In this ICP I have learnt about Natural Language Processing and Natural Language ToolKit python module. I have also learnt to implement basic NLP techniques like unigram, bigram, trigram, tokenization, pos tagging, lemmatization, normalization, entity extraction, language model.
Downloading and installing NLTK on OSX was bit hard for me as I am new to this OS.