Lab Assignment 5 - SaratM34/KDM-Lab-Assignments GitHub Wiki
Name: Mudunuri Sri Sai Sarat Chandra Varma
Class Id: 14
Objective: The objective of this lab is to enhance your question answering system (from lab 4) using the classification/clustering techniques with your datasets (questions and answers) by following steps in task(1) using our own data and yahoo question answering dataset and report the results and also to enrich the question answering system using machine learning approaches (clustering and classification).
Write a simple spark program to conduct clustering and classification techniques with the datasets (questions and answers) as part of your question answering system.
- Dataset:Your own dataset:
- Conduct and compare K-Means vs LDA: When conducted K-Means and LDA on a dataset we can observe that the training time taken is way less in the case of K-means than in LDA. But when pre-processing time is taken into consideration LDA pre-processes faster than the K-Means.
- Conduct and compare Different Feature Vectors:
1) FV1: Data => NLP => TF==> Feature Vector
2) FV2: Data => NLP => TFIDF => Feature Vector
3) FV3: Data => TFIDF => Feature Vector
Comparisons: When taken into consideration the different feature vectors generated using random forests, Naive Bayes, Decision tree we can observe that the Random forest feature vector accuracy is more when compared to Naive Bayes and Decision tree. Also, the decision tree accuracy is more when compared to the Naive Bayes. Also, there are dissimilarities among the confusion matrix that is obtained after training the data.
Conduct and compare the following Classification Algorithms using different feature vectors (FV1, FV2, FV3)
1) Random Forest
2) Naïve Bayes
3) Decision Tree
Comparision: The above mentioned three algorithms are classification algorithms that classifies the training data provided. Among these classification algorithms random forests has the best accuracy and decision tree has good accuracy when compared to random forests. Also, we see that the processing speed of random forest is more when compared to other algorithms. When used against different datasets sometime random forest gave less accuracy and naive bayes and decision tree gave mush good results. We can conclude that random forest is good on large datasets where the naive bayes and decision tree are good in case of smaller datasets and gives better accuracy.
Enhance your question answering system (from lab 4) using the classification/clustering techniques with your datasets (questions and answers) by following steps in task(1).
-
Report on the results on clustering your questions/answers datasets: K-Means vs LDA
-
Report on the results on the different Feature Vectors used in classification on your questions/answers datasets
- FV1: Data => NLP => TF==> Feature Vector
- FV2: Data => NLP => TFIDF => Feature Vector
- FV3: Data => TFIDF => Feature Vector
- Report on the results and comparative evaluation on the following Classification algorithms using different feature vectors (FV1, FV2, FV3)
- Random Forest
- Naïve Bayes
- Decision Tree
- Report your insights on each of the task.
Question Answering system should be able to enrich the questions and answers based Machine Learning approaches (classification of questions and answers).
The Question/Answering system has been extended using Machine learning approaches by clustering and classification. First, the given dataset has been preprocessed using NLP, OpenIE, ConceptNet, WordNet, LDA, TFIDF and that processed dataset is given to the machine learning approaches to enrich the dataset much more to generate much more accurate answers. Using random forests algorithm we have given the input dataset from BBC and also yahoo question answering dataset it trains the algorithm and gives the accuracy and feature vector. Using the feature vector the given dataset is classified into different classes and is used while asking the question to categorize the questions and giving answers based on the category. Thereby the system will predict much more accurate answers than before. Using these approaches when a new question is asked that is not being trained the system can be able to predict approximate answers. The following are some questions and answers that has been enriched using machine learning approaches.