Module 2: Lab #1 - VidyullathaKaza/BigData_Programming_Spring2020 GitHub Wiki

PART-1: Find Facebook mutual friends using Map reduce on Spark.

Algorithm Implementation:

Map and reduce functions:

Map function:

Generates key value pairs.

Remaining friends are present in the value.

Reduce function:

Grouping of keys is done before reduce task.

Intersection of same keys is done and final result is obtained.

Dataset Input:

Dataset Output:

Spark code runs more faster than Hadoop Map-reduce code due to in-memory computation.

Part-2: Spark Dataframes

FIFA dataset is used to implement this.

a.

Dataframe is created by loading dataset with schema containing various struct types.

b.

Queries:

1. Number of teams in various stages

2. Goals scored in range of [2,5] by home teams

3. Finding attendance for current day and maximum attendance of that stadium

4. Left outer join

5. Pattern matching: playing in finals and homeTeamInitials like 'ITA'

6. Average goals scored by each team

7. Performing UNIONALL and displaying row count

8. FULL JOIN

9. Number of times, when a country scored more than 6 goals.

10. Goals scored by countries in particular period

11. Total goals scored by each country

c.

Queries on Spark RDD's and Spark DataFrame's

RDD is created and displayed.

DataFrame is created from RDD by specifying the underlying schema.

1. Displaying total goals by each nation in descending order

2. Displaying the years in which host wins

4. Displaying the row for a particular year.

5. Selecting countries who have played more matches

Comparison of results:

Output is same for both RDD and DataFrames.

For RDD, Schema should be supplied explicitly, where as data frame implicitly extracts the schema from the given data.

Part-3: Perform word count on Twitter streaming data

To use Twitter Streaming API, developer account is required which provides the credentials to download real time streaming data.

Stream is created and data is filtered.

Filtered data is divided into words.

Output:

Part-4: Spark GraphX

Wordgame dataset is used for this.

Input:

a. Pagerank: Used to find influential vertices and edges in a graph.

Dataset is loaded and vertices and edges are identified.

Graph frame is created based on vertices and edges.

Schema and vertices output:

Edges output:

Pagerank is calculated for the graph frame constructed above.

b. Importance of using Graphx

GraphX is Spark API to implement parallel and iterative graph processing.

Graph frame is constructed from vertices and edges.

References:

https://spark.apache.org/docs/latest/graphx-programming-guide.html#pagerank

https://www.toptal.com/apache/apache-spark-streaming-twitter

https://www.programcreek.com/scala/org.apache.spark.streaming.twitter.TwitterUtils

https://www.kaggle.com/anneloes/wordgame

https://www.edureka.co/community/12000/what-the-difference-between-rdd-and-dataframes-apache-spark