Wiki Report for ICP7 - NagaSurendraBethapudi/Python-ICP GitHub Wiki

Video Link : https://drive.google.com/file/d/1lMSK1AHQ7QXszPEivs2J5e4syGrt-6vO/view?usp=sharing

Question 1 :

Change the classifier in the given source code to

SVM and see how accuracy changes
Set the tfidf vectorizer parameter to use bigram and see how the accuracy changes TfidfVectorizer(ngram_range=(1,2))
Set tfidf vectorizer argument to use stop_words='english'and see how accuracy change

Answer :

Changed the classifier to SVM by using following code:

SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
svc.fit(X_train_tfidf, twenty_train.target)

Accuracy is increased to 92%, since SVM tries to find the best seperator or margin between the classes inorder to reduce the error.

Changed the tf-idf vectorizer parameter to bigrams by using following code:

Accuracy was almost same to MultinominalNB CLassifier

Changed the tf-idf vectorizer parameter to stopwords(English) by using following code:

Accuracy was increased to 81%

Observations :

Accuracy of the model is high when we used SVM classifier than MultinomialNB Classifier

Accuracy using MultinomialNB Classifier : 0.77
Accuracy using SVM Classifier : 0.92

Accuracy of the model is less when tf-idf parameter set to bigram using Multinominal Classifier than the unigram

Accuracy with parameter set to bigram using MultinominalNB Classifier : 0.76
Accuracy of model without parameter changes using MultinominalNB Classifier : 0.77

Accuracy of the model is more when tf-idf parameter set to stopwords using Multinominal Classifier than the bigram

Accuracy with parameter set to stopwords using MultinominalNB Classifier : 0.81
Accuracy with parameter set to bigram using MultinominalNB Classifier : 0.76

Accuracy of the model is equal when tf-idf parameter set to bigram ,stopwords using SVC Classifier to the normal

Accuracy with parameter set to bigram using SVC Classifier : 0.92
Accuracy with parameter set to stopwords using SVC Classifier : 0.92
Accuracy of model without parameter changes using SVC Classifier : 0.92

Question 2 : Extract the following web URL text using BeautifulSoup and save the result in a file “input.txt”. Apply the following on the “input.txt” file https://en.wikipedia.org/wiki/Google

Tokenization
POS
Stemming
Lemmatization
Trigram
Named Entity Recognition

Answer :

Tokenization : Tokenization is a process of breaking a stream of text into words, phrases.

sentence tokenization -breaking text into sentences

Word tokenization - breaking sentences into words

POS : It is a process of classifying the words in a text into the parts of speech and labeling them accordingly.

Stemming : Stemming is a process of grouping the words to its base root form .

Lemmatization : Lemmatization involves on determining the part of speech and then grouping the word to its base root form.

Trigram : Trigram group three words at a time.

Named Entity Recognization It is a process of extracting or locating the words to its pre-defined categories.

Conclusion :

Learned about NLTK techniques

Challenges :

Everything looks good