ICP7 - PallaviArikatla/Python GitHub Wiki
To understand and implement NLP and NLTK features.
PyCharm, Python 3 / Python 2.
QUESTION 1: Change the classifier in the given code to
a. Calculate accuracy score using SVC model
From fetch_20newsgroups dataset create train and test data.
Select categories to make the process time effective.
SVM gets created.
Now pass the data to fit into the model.
Find the predicated score and calculate the score.
b. Changing the tfidfvectorizer to bigram and set the range to (1,2).
The vector will be transformed to range (1,2)
Create MultinominalNB model and fit this data into that model.
Now find the predicated score on the test data followed by calculating accuracy score.
c. Setargument stop_words='english' and calculate the score
Set the argument as "stop_words='english'"
Create a MultinomialNB model and fit the data into it.
Calculate the score.
Output scores are as follows:
QUESTION 2: Extract content in given url to a text file using beautifulsoup
Install beautifulsoup importing from bs4.
Hit the url and extract the content in that page.
Modify the content using html parser.
QUESTION 3: Save the data extracted above to a new file input.txt.
- Save this content into a input.txt file.
The text file will be created as follows:
QUESTION 4: Apply different NLP features to the input.txt file created above.
- Import all the NLP and NLTK packages required.
- Tokenization:
Apply word and sentence tokenizations on the input.txt file.
- Stemming:
Apply three types of stemming: PorterStemmer, LancasterStemmer, SnowballStemmer.
The output will be as follows:
- Lemmatization:
Apply POS method after tokenization. Later apply lemmatization on tokens using wordnet.
- Trigram:
Applies ngram with count 3 on the tokens.
- Named Entity Recognition:
Applying name entity recognition locate and classifies these entities into text.