Tutorial 5 - Nagumkc/CS5560_KDM_Lab-assignments GitHub Wiki

NAME: Nageswara rao Nandigam


ClassID: 18


Lab Assignment #5


1.In Class Question

Write a simple spark program to conduct clustering and classification techniques with the datasets (questions and answers) as part of your question answering system


b.Conduct and compare K-Means vs LDA


c.Conduct and compare Different Feature Vectors


1)FV1: Data => NLP => TF==> Feature Vector


2)FV2: Data => NLP => TFIDF => Feature Vector


3)FV3: Data => TFIDF => Feature Vector


d.Conduct and compare the followingClassification Algorithms using different feature vectors (FV1, FV2, FV3)


1)Random Forest


2)Naïve Bayes


3)Decision Tree


2.Take home Question

a.Using two datasets 1)Your own data2)Yahoo! Answer data (The training datacontains 2,698 questions, already labeled with one of the following 7 categories. The test datacontains 1,874 questions that are unlabeled) Questions into one of the following 7 categories:1)Business&Finance2)Computers&Internet3)Entertainment&Music4)Family&Relationships5)Education&Reference6)Health7)Science&Mathematics


b.Report on the results on clustering your questions/answers datasets: K-Means vs LDA

K-Means:

LDA:


c.Report on the results on the different Feature Vectors used in classification on your questions/answers datasets


1)FV1: Data => NLP => TF==> Feature Vector


2)FV2: Data => NLP => TFIDF => Feature Vector


3)FV3: Data => TFIDF => Feature Vector


d.Report on the results and comparative evaluation on the following Classification algorithms using different feature vectors (FV1, FV2, FV3)


1)Random Forest

FV1:

FV2:

FV3:


2)Naïve Bayes

FV1:

FV2:

FV3:


3)Decision Tree

FV1:

FV2:

FV3:


Question and answer system

Question 1:

Output:

Question 2:

Output:

Question 3:

Output: