ICP 7 - Saiaishwaryapuppala/CSEE5590_python_Icp GitHub Wiki

Python and Deep Learning: Special Topics

Rajeshwari Sai Aishwarya Puppala

Student ID: 16298162

Class ID: 35

In class programming: 6

b. change the tfidfvectorizer to use bigram and see how the accuracy changes
TfidfVectorizer(ngram_range=(1,2))

c. Setargument stop_words='english'and see how accuracy changes

Extract the following web URL text using BeautifulSouphttps://en.wikipedia.org/wiki/Google
Save it in input.txt

4.Apply the following on the “input.txt”and show output:

a. Tokenization

b. POS

c. Stemming

d. Lemmatization

e. Trigram

f. Named Entity Recognition

Import the Packages word_tokenize,sent_tokenize, WordNetLemmatizer, PorterStemmer, ne_chunk
Before importing download the necessary packages from nltk
Take the text from which has been extracted with the scrapping code
Do the word tokenization and sentence tokenization
Now give the tokens as the input and find the trigrams in it by specifying the no. of grams = 0
Give the word tokens as an input and perform the Lemmatization and Stemming on it.
Now find the POS and NOR by giving the word tokens as an input

Tokens and Trigram

**** Lemma, Stemming, POS and NOR****

Import the necessary packages required
Fetch the train and test data set of 20 newsgroups from the sci-kit learn
Convert the collection of raw documents from train dataset to a matrix of TF-IDF features with the help of TfidfVectorizer
Convert it normally, with bigram and stop words with "English".
Initialize the Knn classifier
After converting fit the train data with the Knn Classifier
Now predict the values on the test data which is present in the 20 newsgroups
Calculate the accuracy scores with the true test data and the predicted values.
Repeat the process with other 2 vectors and check the accuracy which are better.