LAB ASSIGNMENT 2 - Sreelakshmi-N/CS5560SreelakshmiLabAssignment GitHub Wiki

CS5560 Knowledge Discovery Management

Name: Sreelakshmi Nandanamudi

Classid: 17

1. In Class Question

Generate the output (changes or transformations in the data) manually when the following Spark tasks are applied on the input text.Show your output in details.

Input:

The dog saw John in the park

The little bear saw the fine fat trout in the rocky brook.

Spark Tasks:

a.Map vs FlatMap

a) map

It returns a new RDD by applying a function to all elements of input RDD.

b)flatMap

It returns a new RDD by first applying a function to all elements of input RDD, then flattening the results.

b.Map Reduce

MapReduce is a framework for processing parallelizable problems across large datasets using a large number of computers (nodes), collectively referred to as a cluster.

"Map" step:

Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of redundant input data is processed.

"Shuffle" step:

Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node.

"Reduce" step:

Worker nodes now process each group of output data, per key, in parallel.

c.Group by Starting Letter (Draw diagram how the spark methods used changes the data similar wordcount diagram as shown below)

2.In Class Question

Write a simple spark program to read a dataset and group each word by the starting letter of its lemmatized word (in this exercise, we assume case-not-sensitive).

a.Write a function F in Java using CoreNLP to extract Lemmatized Words

b.Call the function F from a SparkTransformationfunction

c.Use 1 (c) diagram to guide the rest of the task.

An example of this task is shown below:

Input: This is a question answering system. The question is from Quora. Output: T=> This, The

A => a, answer

I => is

Q => question, Quora

S => system

F => from

Note that the lemmatized word of answering is answer.

The below screenshot depicts the usage of the groupby keyword in the spark to group the words based on the starting letter.

3.Take home Question

Create a simple question answering system as an extension of the dataset and tasks done in (2). Continuation from Tutorial 1B. Make sure to use at least two Spark Transformations and two Spark Actions.

Using Spark methods to read the dataset:

I have taken the dataset from the News domain and read the dataset using the spark method as shown below.I have written code in the java and converted the file to RDD and read each line in the RDD.

Using the spark’s transformation and actions to call the coreNLP function for processing dataset:

I have used spark transformations Map,flatmap to map the words to their letters and flatmap to map items into a single entity and used spark actions collect count, take to count the number of entities and take particular elements from the processed dataset.

Simple question answering system using at least two Spark Transformations and two Spark Actions:

Draw a process diagram for spark methods used (similar to the word count diagram) to guide you through the coding process. Include your diagram in your report.