ICP 7 - PavankumarManchala/Python-and-Deep-Learning-Programming-ICPs GitHub Wiki
Submitted by:
Pavankumar Manchala, 22
Technologies Used:
Pycharm
Tasks:
=> Text processing like unigram, bigram, trigram, tokenization, pos tagging, lemmatization, normalization, entity extraction, language model.
- Extracted the following web URL text using Beautiful Soup
https://en.wikipedia.org/wiki/Google
- Then it is saved into text file html.parser
Then Stating the execution and writing the content into the file named input.txt.
And we are writing the text present inside the url.
- Apply the following on the text and show output: a. Tokenization, POS, Stemming, Lemmatization, Trigram, Named Entity Recognition
=> Tokenization
Output of word tokenization:
Output of sentence tokenization:
=> POS(Parts of Speech): Output of POS:
=> Lemmatization:
Output for Lemmatization:
=> Trigram
Output of Trigram:
=> Named Entity Recognition
Output of Named Entity Recognition:
- Change the classifier in the given code to a. KNeighborsClassifierand see how accuracy changes b. change the tfidf vectorizer to use bigram and see how the accuracy changes
TfidfVectorizer(ngram_range=(1,2))
c. Put argument stop_words='english'and see how accuracy changes
The accuracy for Bigrams is decreased as we are increasing the features resulting in overfitting the model, where as when we use the stop words, the model accuracy is increased as the model has deceased its features.
Among all the models, KNN has gained high accuracy. Output of Multinominal NB accuracy and its accuracy when using bigram and added stop words:
Video Explanation: https://drive.google.com/open?id=1I7zzXMx_U55tBSLDZENmOcCXmiBtNWa4