LAB ASSIGNMENT 4 - Sreelakshmi-N/CS5560SreelakshmiLabAssignment GitHub Wiki
CS5560 Knowledge Discovery Management
Name: Sreelakshmi Nandanamudi
Classid: 17
Triplet Extraction
-
OpenieCoreNLP : It refers to the extraction of structured relation triples from plain text.
Sentence: Barack Obama was born in Hawaii
Triple: (Barack Obama; was born in; Hawaii) corresponding to the open domain relation "was born in".
-
ConceptNet: Concept Extraction using ConceptNet5. REST API used to retrieve the concepts or relationships. Online service provides list of possible relations.
WordNet: WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.
LDA : Latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.
- MLLibImplementation of LDA
1. In Class Question
Write a simple spark program to read a dataset and do the following tasks
a.Extract Triplets using OpenIE
The below screenshot depicts the output of extracting triplets using OpenIE
b.Extract Semantic Meaning using ConceptNet
The below screenshot depicts the output of extracting Semantic Meaning using ConceptNet
c.Extract Synonyms using WordNet
The below screenshot depicts the output of extracting synonyms using WordNet
d.Group the Data into LDA in below given pipeline and compare results
i.Data=>LDA
The below screenshot depicts the output after applying the data to LDA
ii.Data=> NLP =>LDA
The below screenshot depicts the output after processing the data by using NLP operations and giving that data to LDA
iii.Data=>NLP=>StopWord=>LDA
The below screenshot depicts the output after processing the data by using NLP opertions and stop words and giving that data to LDA
iv.Data=>NLP=>StopWord=>TFIDF=>LDA
The below screenshot depicts the output after processing the data by using NLP opertions, removing stop words and calculating TF-IDF values giving that data to LDA
Report your insights on each of the task
In the first task I used Open information extraction (open IE) which refers to the extraction of relation tuples, typically binary relations, from plain text. The central difference is that the schema for these relations does not need to be specified in advance; typically the relation name is just the text linking two arguments. Here I have taken the dataset from News domain and based on the relations between text elements corresponding triplets are extracted.
In the second task I used ConceptNet which focuses on semantic relationships between compound concepts. Based on the given dataset for each word the corresponding semantic meaning is extracted.
In the third task I used WordNet which is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Here the corresponding synonyms are extracted from the given dataset and it is very useful inn question answering system.
In the fourth task I have used LDA for my datasets. First I applied LDA for my dataset and next I have used nlp operations on the given dataset using LDA and for the third I have used stop words and for fourth I have used TF-IDF to process data.
2.Take home Question
Create a simple question answering system as an extension of the dataset and tasks done in (1).
Continuation from Tutorial 3.
a.Use OpenIE for triplet extraction
b.Use ConceptNet to enhance the semantic meaning of entities
c.Use WordNet and LDA to enhance the answers and reformulate or group questions
The question answering system has been extended using OpenIE, ConceptNet, WordNet, LDA.
Question1:
Answer1
Question2:
Answer2
Question3:
Answer3