Tutorial 3 - Nagumkc/CS5560_KDM_Lab-assignments GitHub Wiki
NAME: Nageswara rao Nandigam
ClassID: 18
StudentID: 16244177
Lab Assignment #3
1.In Class Question
Generate the output (changes or transformations in the data) manually when the following tasks are applied on the input text. Show your output in details.
Input:
Doc#1: The dog saw John in the park
Doc#2: The little bear saw the fine fat trout in the rocky brook.
Doc#3: The dog started chasing John.
Doc#4: The little bear caught a fish in the rocky brook.
Tasks:
a.Find out the top TF-IDF words for the above input.
(fine,0.9162907318741551)
(started,0.9162907318741551)
(park,0.9162907318741551)
(fat,0.9162907318741551)
(caught,0.9162907318741551)
(chasing,0.9162907318741551)
(fish,0.9162907318741551)
(a,0.9162907318741551)
(trout,0.9162907318741551)
(rocky,0.5108256237659907)
(little,0.5108256237659907)
(brook,0.5108256237659907)
(saw,0.5108256237659907)
(John,0.5108256237659907)
(bear,0.5108256237659907)
(dog,0.5108256237659907)
(in,0.22314355131420976)
(the,0.22314355131420976)
(The,0.0)
b.Find out the top TF-IDF words for the lemmatized input
(fine,0.9162907318741551)
(park,0.9162907318741551)
(fat,0.9162907318741551)
(catch,0.9162907318741551)
(fish,0.9162907318741551)
(chase,0.9162907318741551)
(start,0.9162907318741551)
(a,0.9162907318741551)
(trout,0.9162907318741551)
(see,0.5108256237659907)
(rocky,0.5108256237659907)
(little,0.5108256237659907)
(brook,0.5108256237659907)
(John,0.5108256237659907)
(bear,0.5108256237659907)
(dog,0.5108256237659907)
(in,0.22314355131420976)
(the,0.0)
c.Find out the top TF-IDF words for the n-gram based input.
(fat trout,0.9162907318741551)
(caught a,0.9162907318741551)
(bear caught,0.9162907318741551)
(fine fat,0.9162907318741551)
(the park,0.9162907318741551)
(the fine,0.9162907318741551)
(dog saw,0.9162907318741551)
(trout in,0.9162907318741551)
(saw the,0.9162907318741551)
(bear saw,0.9162907318741551)
(started chasing,0.9162907318741551)
(a fish,0.9162907318741551)
(chasing John,0.9162907318741551)
(John in,0.9162907318741551)
(fish in,0.9162907318741551)
(saw John,0.9162907318741551)
(dog started,0.9162907318741551)
(The dog,0.5108256237659907)
(the rocky,0.5108256237659907)
(rocky brook,0.5108256237659907)
2.In Class Question
Write a simple spark program to read a dataset and find the W2V Synonyms for the Top TF-IDF Words
a.Try without NLP
Synonyms for dog:
caught 6.275484480613382E-4
John. 6.159217595860986E-4
Synonyms for John:
dog 6.380080398033057E-4
park 5.172858663396528E-4
in 5.087652541175444E-4
The 4.692270897133099E-4
rocky 3.437118454735725E-4
Synonyms for caught:
bear 6.798945591665026E-4
dog 6.185775640231738E-4
The 4.299925528788974E-4
code:
output:
b.Try with Lemmatization
Synonyms for Fine:
The 8.713233708040252E-4
John 5.456893775785099E-4
started 3.3536475137531396E-4
fish 3.013059033037166E-4
John. 2.1825895982814292E-4
Synonyms for Park:
The 4.6677654687335E-4
John 4.3175463488619956E-4
caught 2.928778318369419E-4
brook. 2.6030819920509193E-4
Code:
Output:
c.Try with NGrams
Synonyms for caught a:
fish in : 2.674634886195956E-4
ball hear: 2.5775519920509193E-4
Synonyms for The dog:
Saw john : 4.456778318369419E-4
started chasing: 4.416619920509193E-4
code:
output:
3.Take home Question
Create a simple question answering system as an extension of the dataset and tasks done in (2). Continuation from Tutorial 2. Question Answering system should be able to enrich the questions and answers based TF-IDF, W2V and N-Gram approaches. Use the diagram to guide you.
Removing stop words from dataset before processing answer set with the help of TD_IDF
Output file: After removing stop words
Question 1: Who
Output:
Question 2: where
Output:
Question 3: when
Output: