ICP7 - PallaviArikatla/Python GitHub Wiki
OBJECTIVE
To understand and implement NLP and NLTK features.
SOFTWARE REQUIRED
PyCharm, Python 3 / Python 2.
IMPLEMENTATION:
QUESTION 1: Change the classifier in the given code to
a. Calculate accuracy score using SVC model
-
From fetch_20newsgroups dataset create train and test data.
-
Select categories to make the process time effective.
-
SVM gets created.
-
Now pass the data to fit into the model.
-
Find the predicated score and calculate the score.
b. Changing the tfidfvectorizer to bigram and set the range to (1,2).
-
The vector will be transformed to range (1,2)
-
Create MultinominalNB model and fit this data into that model.
-
Now find the predicated score on the test data followed by calculating accuracy score.
c. Setargument stop_words='english' and calculate the score
-
Set the argument as "stop_words='english'"
-
Create a MultinomialNB model and fit the data into it.
-
Calculate the score.
Output scores are as follows:
QUESTION 2: Extract content in given url to a text file using beautifulsoup
-
Install beautifulsoup importing from bs4.
-
Hit the url and extract the content in that page.
-
Modify the content using html parser.
QUESTION 3: Save the data extracted above to a new file input.txt.
- Save this content into a input.txt file.
The text file will be created as follows:
QUESTION 4: Apply different NLP features to the input.txt file created above.
- Import all the NLP and NLTK packages required.
- Tokenization:
Apply word and sentence tokenizations on the input.txt file.
- Stemming:
Apply three types of stemming: PorterStemmer, LancasterStemmer, SnowballStemmer.
The output will be as follows:
- Lemmatization:
Apply POS method after tokenization. Later apply lemmatization on tokens using wordnet.
- Trigram:
Applies ngram with count 3 on the tokens.
- Named Entity Recognition:
Applying name entity recognition locate and classifies these entities into text.