Wiki Report for ICP7 - NagaSurendraBethapudi/Python-ICP GitHub Wiki
https://drive.google.com/file/d/1lMSK1AHQ7QXszPEivs2J5e4syGrt-6vO/view?usp=sharing
Video Link :Question 1 :
Change the classifier in the given source code to
- SVM and see how accuracy changes
- Set the tfidf vectorizer parameter to use bigram and see how the accuracy changes TfidfVectorizer(ngram_range=(1,2))
- Set tfidf vectorizer argument to use stop_words='english'and see how accuracy change
Answer :
- Changed the classifier to SVM by using following code:
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
svc.fit(X_train_tfidf, twenty_train.target)
Accuracy is increased to 92%, since SVM tries to find the best seperator or margin between the classes inorder to reduce the error.
- Changed the tf-idf vectorizer parameter to bigrams by using following code:
Accuracy was almost same to MultinominalNB CLassifier
- Changed the tf-idf vectorizer parameter to stopwords(English) by using following code:
Accuracy was increased to 81%
Observations :
Accuracy of the model is high when we used SVM classifier than MultinomialNB Classifier
- Accuracy using MultinomialNB Classifier : 0.77
- Accuracy using SVM Classifier : 0.92
Accuracy of the model is less when tf-idf parameter set to bigram using Multinominal Classifier than the unigram
- Accuracy with parameter set to bigram using MultinominalNB Classifier : 0.76
- Accuracy of model without parameter changes using MultinominalNB Classifier : 0.77
Accuracy of the model is more when tf-idf parameter set to stopwords using Multinominal Classifier than the bigram
- Accuracy with parameter set to stopwords using MultinominalNB Classifier : 0.81
- Accuracy with parameter set to bigram using MultinominalNB Classifier : 0.76
Accuracy of the model is equal when tf-idf parameter set to bigram ,stopwords using SVC Classifier to the normal
- Accuracy with parameter set to bigram using SVC Classifier : 0.92
- Accuracy with parameter set to stopwords using SVC Classifier : 0.92
- Accuracy of model without parameter changes using SVC Classifier : 0.92
Question 2 : Extract the following web URL text using BeautifulSoup and save the result in a file “input.txt”. Apply the following on the “input.txt” file https://en.wikipedia.org/wiki/Google
- Tokenization
- POS
- Stemming
- Lemmatization
- Trigram
- Named Entity Recognition
Answer :
- Tokenization : Tokenization is a process of breaking a stream of text into words, phrases.
sentence tokenization -breaking text into sentences
Word tokenization - breaking sentences into words
- POS : It is a process of classifying the words in a text into the parts of speech and labeling them accordingly.
- Stemming : Stemming is a process of grouping the words to its base root form .
- Lemmatization : Lemmatization involves on determining the part of speech and then grouping the word to its base root form.
- Trigram : Trigram group three words at a time.
- Named Entity Recognization It is a process of extracting or locating the words to its pre-defined categories.
Conclusion :
Learned about NLTK techniques
Challenges :
Everything looks good