Module 1_ICP 7: Natural Language Processing in Python using NLTK - acikgozmehmet/PythonDeepLearning GitHub Wiki

#Natural Language Processing in Python using NLTK

Objectives:

The following topics are covered.

  1. NLP (Natural language processing)
  2. NLTK (Natural Language Toolkit)

Overview

NLP (Natural language processing)

  • Computer aided text analysis of human language
  • The goal is to enable machines to understand human language and extract meaning from text
  • The “Natural Language Toolkit” is a python module that provides a variety of functionality that will aid us in processing text

NLTK (Natural Language Toolkit)

  • An open source library which simplifies the implementation of Natural Language Processing(NLP) in Python.
  • Text processing like unigram, bigram, trigram, tokenization, pos tagging, lemmatization, normalization, entity extraction, language model.
  • Learning these features will help us for more meaningful project as document classification, spelling corrector, document summarization, etc

In Class Programming

1. Change the classifier in the given code to

a. SVM and see how accuracy changes b. change the tfidf vectorizer to use bigram and see how the accuracy changes TfidfVectorizer(ngram_range=(1,2)) c. Set argument stop_words='english' and see how accuracy changes

Click here to get the source code

2. Extract the following web URL text using BeautifulSoup

3. Save it in input.txt

  https://en.wikipedia.org/wiki/Google

Click here to get the source code

4. Apply the following on the text and show output:

  • a. Tokenization
  • b. POS
  • c. Stemming
  • d. Lemmatization
  • e. Trigram
  • f. Named Entity Recognition

Click here to get the source code

References

https://github.com/wade12/WikiScraper/blob/master/

http://www.w3resource.com/python-exercises/

https://www.learnpython.org/