Tutorial 5 - Nagumkc/CS5560_KDM_Lab-assignments GitHub Wiki
NAME: Nageswara rao Nandigam
ClassID: 18
Lab Assignment #5
1.In Class Question
Write a simple spark program to conduct clustering and classification techniques with the datasets (questions and answers) as part of your question answering system
b.Conduct and compare K-Means vs LDA
c.Conduct and compare Different Feature Vectors
1)FV1: Data => NLP => TF==> Feature Vector
2)FV2: Data => NLP => TFIDF => Feature Vector
3)FV3: Data => TFIDF => Feature Vector
d.Conduct and compare the followingClassification Algorithms using different feature vectors (FV1, FV2, FV3)
1)Random Forest
2)Naïve Bayes
3)Decision Tree
2.Take home Question
a.Using two datasets 1)Your own data2)Yahoo! Answer data (The training datacontains 2,698 questions, already labeled with one of the following 7 categories. The test datacontains 1,874 questions that are unlabeled) Questions into one of the following 7 categories:1)Business&Finance2)Computers&Internet3)Entertainment&Music4)Family&Relationships5)Education&Reference6)Health7)Science&Mathematics
b.Report on the results on clustering your questions/answers datasets: K-Means vs LDA
K-Means:
LDA:
c.Report on the results on the different Feature Vectors used in classification on your questions/answers datasets
1)FV1: Data => NLP => TF==> Feature Vector
2)FV2: Data => NLP => TFIDF => Feature Vector
3)FV3: Data => TFIDF => Feature Vector
d.Report on the results and comparative evaluation on the following Classification algorithms using different feature vectors (FV1, FV2, FV3)
1)Random Forest
FV1:
FV2:
FV3:
2)Naïve Bayes
FV1:
FV2:
FV3:
3)Decision Tree
FV1:
FV2:
FV3:
Question and answer system
Question 1:
Output:
Question 2:
Output:
Question 3:
Output: