python icp7 - koushikskr/python GitHub Wiki
Introduction:
This ICP is about usage of Basic NLP techniques like unigram, bigram, trigram, tokenization, pos tagging, lemmatization, normalization, entity extraction, language model
Question 1:
a) Use SVM and see accuracy changes Procedure: Creates train and test data from fetch_20newsgroups dataset by importing sklearn.datasets. Created SVM model. Passed the trained data to fit inot the model. Found predicated data with the test data. Calculated accuracy score
b) Change the tfidfvectorizer to use bigram
procedure:
Changed the trained data with nggram range(1,2).
Created MultinomialNB model
Passed the trained data to fit into the created model.
Found predicated score on the test data
Calculated score.
c) Setargument stop_words='english'
procedure:
Changed the trained data with stop words english.
Created MultinomialNB model
Passed the trained data to fit into the created model.
Found predicated score on the test data
Calculated score.
Question 2:
Extract google wiki page content to a text file using BeautifulSoup
procedure: Hit the google wiki url with urllib.request.urlopen() method. Modify the source content with html parser using BeautifulSoup library.
Question 3:
Save the wiki page content into a text file
Created a file called 'input.txt' and written the content of the wiki page got from 2nd question into it.
Question 4:
Apply different word algorithms on the created file input.txt
procedure:
Imported nltk library and required packages.
Applied word_tokenization and sentence_tokenization on the file.
Applied POS using nltk.pos_tag() method after tokenization.
Applied PorterStemmer, LancastersStemmer and SnowballStemmer algorithms.
Applied Lemmatization on the tokenized content.
Applied ngrams with count 3 on the tokenized content.
Applied Name Entity Recognization.