ICP 7 - awais546/Python-and-Deep-Learning GitHub Wiki

Python and Deep Learning

Introduction

In this lab we learned about one of the most important module of AI known as natural language processing. We learned the basic topics of natural language processing like tokenization, stemming, lemmatization etc. The library used for natural language processing in python is NLTK.

Tasks

The tasks performed in this lab are as follows.

Perform tokenization and apply SVM
Extract the text from the URL and save it in the text file
Perform NLTK processes on the text

Tokenization and SVM

The dataset used for this is fetch_20newsgroups. In order to import the SVM model you can apply the following code. from sklearn.svm import SVC

By comparing the SVM model with naive bayes model we see that SVM performed better. The accuracy is shown in the following screenshot.

By changing the tdif vectorizer to bigram the accuracy of both the models have decreased as shown in the following screenshot.

By adding the stop words argument as English we can see the the accuracy has increased more.

Extract text from URL

The texts from the url can be extracted using the Beautifulsoup4 library. The following code shows the method to extract and save it in the text file.