python icp7 - koushikskr/python GitHub Wiki

Introduction:

This ICP is about usage of Basic NLP techniques like unigram, bigram, trigram, tokenization, pos tagging, lemmatization, normalization, entity extraction, language model

Question 1:

a) Use SVM and see accuracy changes Procedure: Creates train and test data from fetch_20newsgroups dataset by importing sklearn.datasets. Created SVM model. Passed the trained data to fit inot the model. Found predicated data with the test data. Calculated accuracy score

b) Change the tfidfvectorizer to use bigram procedure: Changed the trained data with nggram range(1,2). Created MultinomialNB model Passed the trained data to fit into the created model. Found predicated score on the test data Calculated score. c) Setargument stop_words='english' procedure: Changed the trained data with stop words english. Created MultinomialNB model Passed the trained data to fit into the created model. Found predicated score on the test data Calculated score.

Question 2:

Extract google wiki page content to a text file using BeautifulSoup

procedure: Hit the google wiki url with urllib.request.urlopen() method. Modify the source content with html parser using BeautifulSoup library.

Question 3:

Save the wiki page content into a text file

Created a file called 'input.txt' and written the content of the wiki page got from 2nd question into it.

Question 4:

Apply different word algorithms on the created file input.txt

procedure: Imported nltk library and required packages. Applied word_tokenization and sentence_tokenization on the file. Applied POS using nltk.pos_tag() method after tokenization. Applied PorterStemmer, LancastersStemmer and SnowballStemmer algorithms. Applied Lemmatization on the tokenized content. Applied ngrams with count 3 on the tokenized content. Applied Name Entity Recognization.