Lab Assignment 3 - SaratM34/KDM-Lab-Assignments GitHub Wiki

Name: Mudunuri Sri Sai Sarat Chandra Varma
ClassId: 14

Objective: The objective of this lab assignment is to create a simple question answering system as an extension of the dataset and tasks done in (2). Continuation from Tutorial 2. Question Answering system should be able to enrich the questions and answers based TF-IDF, W2V and N-Gram approaches.

1.In Class Question

Question 1:

Generate the output (changes or transformations in the data) manually when the following tasks are applied on the input text. Show your output in details.

Input:
Doc#1: The dog saw John in the park
Doc#2: The little bear saw the fine fat trout in the rocky brook.
Doc#3: The dog started chasing John.
Doc#4: The little bear caught a fish in the rocky brook.

Tasks:
a.Find out the top TF-IDF words for the above input.
b.Find out the top TF-IDF words for the lemmatized input
c.Find out the top TF-IDF words for the n-gram based input.

a.Find out the top TF-IDF words for the above input.

Answer: The below are the top TF-IDF words for the give input.

Fat
Finr
Park
Caught
Chasing
Started
Trout
Fish
Saw
Rocky
Little
Brook
Bear
Dog
John

b.Find out the top TF-IDF words for the lemmatized input

Answer: The below are the top TF-IDF words for the lemmatized input.

fat
fine
park
catch
chase
start
trout
fish
see
rocky
little
brook
bear
dog
John

c.Find out the top TF-IDF words for the n-gram based input.

Answer: The below are the topTF-IDF words for n-gram based input.

chase
fat
park
a
chase
Fine
fish
Catch
start
fine
park
fish
trout
fat
brook
trout
catch
start
rocky
John

Draw Map-Reduce Diagram for the TF-IDF and W2V (similar to Lab2)

Map-Reduce Diagram for the TF-IDF
Map-Reduce Diagram for the W2V

Question 2:

Write a simple spark program to read a dataset and find the W2V Synonyms for the Top TF-IDF Words
a.Try without NLP
b.Try with Lemmatization
c.Try with NGrams.
Compare the results from (a) , (b) and (c)

a. Try without NLP: The below are the WORD2VEC(W2V) synonyms for the Top TF-IDF words.

Answer:

Reading DataSet:

Top TF-IDF Words:

Finding W2V synonyms for Top TF-IDF words without CoreNLP:

b.Try with Lemmatization

Answer:

Top TF-IDF Words:

Finding W2V synonyms for Top TF-IDF words with lemmatization:

c.Try with NGrams.:

Answer

Top TF-IDF Words:

Finding W2V synonyms for Top TF-IDF words with N-grams:

Compare the results from (a) , (b) and (c): By comparing the outputs from W2V without NLP, with lemmatization and N-gram we can observe that the words are filtered after using the top TF-IDF words as input and after implementing these the ouputs for the questions are much more accurate than previous i.e without TF-IDF or N-gram.

3.Take home Question

Create a simple question answering system as an extension of the dataset and tasks done in (2). Continuation from Tutorial 2. Question Answering system should be able to enrich the questions and answers based TF-IDF, W2V and N-Gram approaches. Use the diagram to guide you.

Question Answering with enrinched questions and answers using TF-IDF, W2V:

Using TF-IDF, W2V and N-gram to enrich the questions and answers:

Processed the documents for term frequency and inverse document frequency and filtered the top TF-IDF words and enriched both question and answers. The filtered top TF-IDF words are given for W2V and N-gram to get synonyms for the top TF-IDF words. When the question is asked the answers are processed from the set of the top TF-IDF words after processing and answers are outputted.

The following are the code snippets for TF-IDF, W2V and N-gram: