ICP 7 - PavankumarManchala/Python-and-Deep-Learning-Programming-ICPs GitHub Wiki

Submitted by:

Pavankumar Manchala, 22

Technologies Used:

Pycharm

Tasks:

=> Text processing like unigram, bigram, trigram, tokenization, pos tagging, lemmatization, normalization, entity extraction, language model.

Extracted the following web URL text using Beautiful Soup

https://en.wikipedia.org/wiki/Google

Then it is saved into text file html.parser

Then Stating the execution and writing the content into the file named input.txt.

And we are writing the text present inside the url.

Apply the following on the text and show output: a. Tokenization, POS, Stemming, Lemmatization, Trigram, Named Entity Recognition

=> Tokenization

Output of word tokenization:

Output of sentence tokenization:

=> POS(Parts of Speech): Output of POS:

=> Lemmatization:

Output for Lemmatization:

=> Trigram

Output of Trigram:

=> Named Entity Recognition

Output of Named Entity Recognition:

Change the classifier in the given code to a. KNeighborsClassifierand see how accuracy changes b. change the tfidf vectorizer to use bigram and see how the accuracy changes

TfidfVectorizer(ngram_range=(1,2))

c. Put argument stop_words='english'and see how accuracy changes

The accuracy for Bigrams is decreased as we are increasing the features resulting in overfitting the model, where as when we use the stop words, the model accuracy is increased as the model has deceased its features.

Among all the models, KNN has gained high accuracy. Output of Multinominal NB accuracy and its accuracy when using bigram and added stop words:

Video Explanation: https://drive.google.com/open?id=1I7zzXMx_U55tBSLDZENmOcCXmiBtNWa4