LAB ASSIGNMENT 3 - Sreelakshmi-N/CS5560SreelakshmiLabAssignment GitHub Wiki

CS5560 Knowledge Discovery Management

Name: Sreelakshmi Nandanamudi

Classid: 17

Tf-idf

Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query

N-Gram

An n-gram is a sequence of n tokens (typically words) for some integer n.

•The NGramclass can be used to transform input features into n-grams.

•NGramtakes as input a sequence of strings (e.g. the output of a Tokenizer).

•The output will consist of a sequence of n-grams where each n-gram is represented by a space-delimited string of n consecutive words

Example:

[input,output] if n = 2

[(go, sample, text),(go sample, sample text)]

[(run, good, health),(run good, good health)]

[(computer, science, study),(computer science, science study)]

In Class Question

Generate theoutput (changes or transformations in the data) manually when the following tasks are applied on the input text. Show your output in details.

Input:

Doc#1: The dog saw John in the park

Doc#2: The little bear saw the fine fat trout in the rocky brook.

Doc#3: The dog started chasing John.

Doc#4: The little bear caught a fish in the rocky brook.

Tasks:

a.Find out the top TF-IDF words for the above input.

b.Find out the top TF-IDF words for the lemmatized input

c.Find out the top TF-IDF words for the n-gram based input.

Draw Map-Reduce Diagram for the TF-IDF and W2V (similar to Lab2)

2.In Class Question

Write a simple spark program to read a dataset and find the W2V Synonyms for the Top TF-IDF Words

a.Try without NLP

b.Try with Lemmatization

c.Try with NGrams

Compare the results from (a) , (b) and (c)

Take home Question

Create a simple question answering system as an extension of the dataset and tasks done in (2).Continuation from Tutorial 2.

Question Answering system should be able to enrich the questions and answers based TF-IDF, W2V and N-Gram approaches. Use the diagram to guide you.

a.Use the Question Type

b.Q&A System

I have taken the output of the text after performing the NLP operations and used that for the question and answering part.