ICP 7 - awais546/Python-and-Deep-Learning GitHub Wiki

Python and Deep Learning

Introduction

In this lab we learned about one of the most important module of AI known as natural language processing. We learned the basic topics of natural language processing like tokenization, stemming, lemmatization etc. The library used for natural language processing in python is NLTK.

Tasks

The tasks performed in this lab are as follows.

  • Perform tokenization and apply SVM
  • Extract the text from the URL and save it in the text file
  • Perform NLTK processes on the text

Tokenization and SVM

The dataset used for this is fetch_20newsgroups. In order to import the SVM model you can apply the following code. from sklearn.svm import SVC

By comparing the SVM model with naive bayes model we see that SVM performed better. The accuracy is shown in the following screenshot.

By changing the tdif vectorizer to bigram the accuracy of both the models have decreased as shown in the following screenshot.

By adding the stop words argument as English we can see the the accuracy has increased more.

Extract text from URL

The texts from the url can be extracted using the Beautifulsoup4 library. The following code shows the method to extract and save it in the text file.

NLTK Operations

Tokenization

Tokenization can be performed using the following code.

POS Tagging

To perform POS tagging use the following code.

Stemming

Three types of stemming is performed. The code is shown below.

Lemmatization

Lemmatization can be performed using the following code.

Trigram

Trigram can be made using the following code. Trigram is applied on each sentence generated from the sentence tokenization.

Named Entity Recognition

Use the following code for NER applied on each line of text.