ICP7 - PallaviArikatla/Python GitHub Wiki

OBJECTIVE

To understand and implement NLP and NLTK features.


SOFTWARE REQUIRED

PyCharm, Python 3 / Python 2.


IMPLEMENTATION:

QUESTION 1: Change the classifier in the given code to

a. Calculate accuracy score using SVC model

  • From fetch_20newsgroups dataset create train and test data.

  • Select categories to make the process time effective.

  • SVM gets created.

  • Now pass the data to fit into the model.

  • Find the predicated score and calculate the score.

b. Changing the tfidfvectorizer to bigram and set the range to (1,2).

  • The vector will be transformed to range (1,2)

  • Create MultinominalNB model and fit this data into that model.

  • Now find the predicated score on the test data followed by calculating accuracy score.

c. Setargument stop_words='english' and calculate the score

  • Set the argument as "stop_words='english'"

  • Create a MultinomialNB model and fit the data into it.

  • Calculate the score.

Output scores are as follows:

QUESTION 2: Extract content in given url to a text file using beautifulsoup

  • Install beautifulsoup importing from bs4.

  • Hit the url and extract the content in that page.

  • Modify the content using html parser.

QUESTION 3: Save the data extracted above to a new file input.txt.

  • Save this content into a input.txt file.

The text file will be created as follows:

QUESTION 4: Apply different NLP features to the input.txt file created above.

  • Import all the NLP and NLTK packages required.

  • Tokenization:

Apply word and sentence tokenizations on the input.txt file.

  • Stemming:

Apply three types of stemming: PorterStemmer, LancasterStemmer, SnowballStemmer.

The output will be as follows:

  • Lemmatization:

Apply POS method after tokenization. Later apply lemmatization on tokens using wordnet.

  • Trigram:

Applies ngram with count 3 on the tokens.

  • Named Entity Recognition:

Applying name entity recognition locate and classifies these entities into text.