LAB 2 WIKI - navyagonug/CS5590-BIG-DATA-PROGRAMMING-USING-HADOOP-AND-SPARK GitHub Wiki

QUESTION 1

OBJECTIVE

Finding facebook common friends using apache spark. In this approach we find common friends between people. If we take any basic friend list of 4 persons such as person friends Ritika Jaya, Lohitha, Aishu

Jaya            Lohitha, aishu

Lohitha         Ritika, aishu

Aishu           Jaya, ritika

Here we can see that ritika has jaya, lohitha and aishu in her friends list and jaya has lohitha and aishu in their friends list. Now the common friend between ritika and jaya are lohitha, aishu. As explained, we compute in the same manner for the rest of the persons present above.

This phases are done in two parts that is mapper and reducer phase. In the mapper function, first the line is splited by using " ". in mapper phase we give each friend its key and value. Here the slice method is used on words and it takes start and end index and will return the new collection with elements that are within the start and end index range which is taken from 1 to words size. From there the mapping starts and are passed on to the reducer phase.

Next comes the reducer phase where the data is grouped based on the key values.Then the list is produced and we get the mutual friends. Here accumulator is used to intersect the data and find mutual friends.

In the last the loaded dataset is given to the file and run the mapper and reducer functions and the output data is stored into the another file.

DATASETS USED

No specific dataset is used for this problem. An input file is created on our own and the total work is done

WORKFLOW

MAPPER PHASE

REDUCER PHASE

MAIN CLASS

PARAMETERS

INPUT

OUTPUT

CONFIGURATIONS

Here we have added changes to the build.sbt file and the screenshot is shown below.

EVALUATION

Apache spark in scala needs less number of lines compared to map reduce program and is fast.

CONTRIBUTIONS

We are a team of three and we worked collaboratively through out the lab assignment. Since, these questions are not exhaustive in nature, Each of us divided the problems and worked them out. Niteesha is responsible for loading dataset and for implementing the code. However, Every one of us have eloquent knowledge in every problem given.

CONCLUSION

Time difference is present here and spark in Scala programs saves time and runs faster rather than the traditional map reduce program.

QUESTION 2

Objective-

There are three datasets given. Using one of them,

we need to create a spark dataframe and perform all the structtypes.
write 10 queries to show different patterns or operations on the dataset.
perform any 5 queries in Spark RDD’s and Spark Data Frames.

Datasets used-

The following set of datasets are used for solving this problem.

1.FIFA World Cup: https://www.kaggle.com/abecklas/fifa-world-cup#WorldCupMatches.csv

Workflow-

Part a -

The basic code for creating a spark dataframe on a dataset involves importing a dataset using its corresponding location and defining the format of the file. Here, the dataset that is considered is related to FIFA WORLDCUP.

The following snippet shows the code for creation of a spark dataframe and different structtypes for corresponding dataset.

Here different fields are defined using keyword struct that helps in different queries.

Part b-

QUERY 1: The following snippet shows a query which is used to display top countries which has maximum attendance.