Wiki Report for ICP7 - NagaSurendraBethapudi/Python-ICP GitHub Wiki

Video Link : https://drive.google.com/file/d/1lMSK1AHQ7QXszPEivs2J5e4syGrt-6vO/view?usp=sharing


Question 1 :

Change the classifier in the given source code to

  1. SVM and see how accuracy changes
  2. Set the tfidf vectorizer parameter to use bigram and see how the accuracy changes TfidfVectorizer(ngram_range=(1,2))
  3. Set tfidf vectorizer argument to use stop_words='english'and see how accuracy change

Answer :

  1. Changed the classifier to SVM by using following code:
  • SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
  • svc.fit(X_train_tfidf, twenty_train.target)

Accuracy is increased to 92%, since SVM tries to find the best seperator or margin between the classes inorder to reduce the error.

  1. Changed the tf-idf vectorizer parameter to bigrams by using following code:

Accuracy was almost same to MultinominalNB CLassifier

  1. Changed the tf-idf vectorizer parameter to stopwords(English) by using following code:

Accuracy was increased to 81%

Observations :

Accuracy of the model is high when we used SVM classifier than MultinomialNB Classifier
  1. Accuracy using MultinomialNB Classifier : 0.77
  2. Accuracy using SVM Classifier : 0.92
Accuracy of the model is less when tf-idf parameter set to bigram using Multinominal Classifier than the unigram
  1. Accuracy with parameter set to bigram using MultinominalNB Classifier : 0.76
  2. Accuracy of model without parameter changes using MultinominalNB Classifier : 0.77
Accuracy of the model is more when tf-idf parameter set to stopwords using Multinominal Classifier than the bigram
  1. Accuracy with parameter set to stopwords using MultinominalNB Classifier : 0.81
  2. Accuracy with parameter set to bigram using MultinominalNB Classifier : 0.76
Accuracy of the model is equal when tf-idf parameter set to bigram ,stopwords using SVC Classifier to the normal
  1. Accuracy with parameter set to bigram using SVC Classifier : 0.92
  2. Accuracy with parameter set to stopwords using SVC Classifier : 0.92
  3. Accuracy of model without parameter changes using SVC Classifier : 0.92

Question 2 : Extract the following web URL text using BeautifulSoup and save the result in a file “input.txt”. Apply the following on the “input.txt” file https://en.wikipedia.org/wiki/Google

  1. Tokenization
  2. POS
  3. Stemming
  4. Lemmatization
  5. Trigram
  6. Named Entity Recognition

Answer :

  1. Tokenization : Tokenization is a process of breaking a stream of text into words, phrases.

sentence tokenization -breaking text into sentences

Word tokenization - breaking sentences into words

  1. POS : It is a process of classifying the words in a text into the parts of speech and labeling them accordingly.

  1. Stemming : Stemming is a process of grouping the words to its base root form .

  1. Lemmatization : Lemmatization involves on determining the part of speech and then grouping the word to its base root form.

  1. Trigram : Trigram group three words at a time.

  1. Named Entity Recognization It is a process of extracting or locating the words to its pre-defined categories.

Conclusion :

Learned about NLTK techniques

Challenges :

Everything looks good