LAB ASSIGNMENT 5 - Sreelakshmi-N/CS5560SreelakshmiLabAssignment GitHub Wiki

CS5560 Knowledge Discovery Management

Name: Sreelakshmi Nandanamudi

Classid: 17

1.In Class Question

Write a simple spark program to conduct clustering and classification techniques with the datasets (questions and answers) as part of your question answering system.

a.Dataset:Your own dataset

I have used the dataset from the News domain

b.Conduct and compare K-Means vs LDA

  1. Both K-means and Latent Dirichlet Allocation (LDA) are unsupervised learning algorithms
  2. The user needs to decide a priori the parameter K, respectively the number of clusters and the number of topics.
  3. If both are applied to assign K topics to a set of N documents, the most evident difference is that K-means is going to partition the N documents in K disjoint clusters.
  4. On the other hand, LDA assigns a document to a mixture of topics. Therefore each document is characterized by one or more topics.
  5. Hence, LDA can give more realistic results than k-means for topic assignment.
  6. The training time for the kmeans algorithm is pretty less than lda while as the processing time is more when compared to lda.

Given news dataset as the input and passed it to LDA algorithm

c.Conduct and compare Different Feature Vectors

1)FV1: Data => NLP => TF==> Feature Vector

Given news dataset as the input and performed NLP operations on the dataset and after that calculating TF values and next passed it to Naive bayes algorithm.

2)FV2: Data => NLP => TFIDF => Feature Vector

Given news dataset as the input and performed NLP operations on the dataset and after that calculating TFIDF values and next passed it to Naive bayes algorithm.

3)FV3: Data => TFIDF => Feature Vector

Given news dataset as the input and after that calculating TFIDF values and next passed it to Naive bayes algorithm.

d.Conduct and compare the following Classification Algorithms using different feature vectors (FV1, FV2, FV3)

1)Random Forest

1)FV1: Data => NLP => TF==> Feature Vector

Given news dataset as the input and performed NLP operations on the dataset and after that calculating TF values and next passed it to Random Forest algorithm.

2)FV2: Data => NLP => TFIDF => Feature Vector

Given news dataset as the input and performed NLP operations on the dataset and after that calculating TFIDF values and next passed it to Random Forest algorithm.

3)FV3: Data => TFIDF => Feature Vector

Given news dataset as the input and after that calculating TFIDF values and next passed it to Random Forest algorithm.

2)Naïve Bayes

1)FV1: Data => NLP => TF==> Feature Vector

Given news dataset as the input and performed NLP operations on the dataset and after that calculating TF values and next passed it to Random Forest algorithm.

2)FV2: Data => NLP => TFIDF => Feature Vector

Given news dataset as the input and performed NLP operations on the dataset and after that calculating TFIDF values and next passed it to Random Forest algorithm.

3)FV3: Data => TFIDF => Feature Vector

Given news dataset as the input and after that calculating TFIDF values and next passed it to Random Forest algorithm.

3)Decision Tree

1)FV1: Data => NLP => TF==> Feature Vector

Given news dataset as the input and performed NLP operations on the dataset and after that calculating TF values and next passed it to Decision Tree algorithm.

2)FV2: Data => NLP => TFIDF => Feature Vector

Given news dataset as the input and performed NLP operations on the dataset and after that calculating TFIDF values and next passed it to Decision Tree algorithm.

3)FV3: Data => TFIDF => Feature Vector

Given news dataset as the input and after that calculating TFIDF values and next passed it to Decision Tree algorithm.

Report your insights on each of the task

From the above screenshots, we can say that the accuracy is high when we perform NLP opeartions and TF calculations for the dataset for classification algorithm,but in the case of TFIDF the accuracy is low.

2.Take home Question

Enhance your question answering system (from lab 4) using the classification/clustering techniques with your datasets (questions and answers) by following steps in task(1).

a.Using two datasets

1)Your own data

2)Yahoo! Answer data(The training datacontains 2,698 questions, already labeled with one of the following 7 categories. The test datacontains 1,874 questions that are unlabeled). Questions into one of the following 7 categories:**

1)Business&Finance

2)Computers&Internet

3)Entertainment&Music

4)Family&Relationships

5)Education&Reference

6)Health

7)Science&Mathematics

I have taken my dataset regarding the science in one folder and the yahoo data set for the science in another folder and performed the clustering and clasification algorithms.

b.Report on the results on clustering your questions/answers datasets: K-Means vs LDA

KMean output:

LDA output:

c.Report on the results on the different Feature Vectors used in classification on your questions/answers datasets

1)FV1: Data => NLP => TF==> Feature Vector

2)FV2: Data => NLP => TFIDF => Feature Vector

3)FV3: Data => TFIDF => Feature Vector

d.Report on the results and comparative evaluation on the following Classification algorithms using different feature vectors (FV1, FV2, FV3)

Using Yahoo dataset

1)Random Forest

1)FV1: Data => NLP => TF==> Feature Vector

Given news dataset as the input and performed NLP operations on the dataset and after that calculating TF values and next passed it to Random Forest algorithm.

2)FV2: Data => NLP => TFIDF => Feature Vector

Given news dataset as the input and performed NLP operations on the dataset and after that calculating TFIDF values and next passed it to Random Forest algorithm.

3)FV3: Data => TFIDF => Feature Vector

Given news dataset as the input and after that calculating TFIDF values and next passed it to Random Forest algorithm.

2)Naïve Bayes

1)FV1: Data => NLP => TF==> Feature Vector

Given news dataset as the input and performed NLP operations on the dataset and after that calculating TF values and next passed it to Naive bayes algorithm.

2)FV2: Data => NLP => TFIDF => Feature Vector

Given news dataset as the input and performed NLP operations on the dataset and after that calculating TFIDF values and next passed it to Naive bayes algorithm.

3)FV3: Data => TFIDF => Feature Vector

Given news dataset as the input and after that calculating TFIDF values and next passed it to Naive bayes algorithm.

3)Decision Tree

1)FV1: Data => NLP => TF==> Feature Vector

Given news dataset as the input and performed NLP operations on the dataset and after that calculating TF values and next passed it to Decision Tree algorithm.

2)FV2: Data => NLP => TFIDF => Feature Vector

Given news dataset as the input and performed NLP operations on the dataset and after that calculating TFIDF values and next passed it to Decision Tree algorithm.

3)FV3: Data => TFIDF => Feature Vector

Given news dataset as the input and after that calculating TFIDF values and next passed it to Decision Tree algorithm.

Question and Answering

Question1

Answer1

Question2

Answer2

Question3

Answer3: