Tutorial 5 - Nagumkc/CS5560_KDM_Lab-assignments GitHub Wiki

NAME: Nageswara rao Nandigam

ClassID: 18

Lab Assignment #5

1.In Class Question

Write a simple spark program to conduct clustering and classification techniques with the datasets (questions and answers) as part of your question answering system

b.Conduct and compare K-Means vs LDA

c.Conduct and compare Different Feature Vectors

1)FV1: Data => NLP => TF==> Feature Vector

2)FV2: Data => NLP => TFIDF => Feature Vector

3)FV3: Data => TFIDF => Feature Vector

d.Conduct and compare the followingClassification Algorithms using different feature vectors (FV1, FV2, FV3)

1)Random Forest

2)Naïve Bayes

3)Decision Tree

2.Take home Question

a.Using two datasets 1)Your own data2)Yahoo! Answer data (The training datacontains 2,698 questions, already labeled with one of the following 7 categories. The test datacontains 1,874 questions that are unlabeled) Questions into one of the following 7 categories:1)Business&Finance2)Computers&Internet3)Entertainment&Music4)Family&Relationships5)Education&Reference6)Health7)Science&Mathematics

b.Report on the results on clustering your questions/answers datasets: K-Means vs LDA

K-Means: