Module 2: Lab #1 - VidyullathaKaza/BigData_Programming_Spring2020 GitHub Wiki
PART-1: Find Facebook mutual friends using Map reduce on Spark.
Algorithm Implementation:
Map and reduce functions:
Map function:
Generates key value pairs.
Remaining friends are present in the value.
Reduce function:
Grouping of keys is done before reduce task.
Intersection of same keys is done and final result is obtained.
Dataset Input:
Dataset Output:
Spark code runs more faster than Hadoop Map-reduce code due to in-memory computation.
Part-2: Spark Dataframes
FIFA dataset is used to implement this.
a.
Dataframe is created by loading dataset with schema containing various struct types.
b.
Queries:
1. Number of teams in various stages
2. Goals scored in range of [2,5] by home teams
3. Finding attendance for current day and maximum attendance of that stadium
4. Left outer join
5. Pattern matching: playing in finals and homeTeamInitials like 'ITA'
6. Average goals scored by each team
7. Performing UNIONALL and displaying row count
8. FULL JOIN
9. Number of times, when a country scored more than 6 goals.
10. Goals scored by countries in particular period
11. Total goals scored by each country
c.
Queries on Spark RDD's and Spark DataFrame's
RDD is created and displayed.
DataFrame is created from RDD by specifying the underlying schema.
1. Displaying total goals by each nation in descending order
2. Displaying the years in which host wins
4. Displaying the row for a particular year.
5. Selecting countries who have played more matches
Comparison of results:
Output is same for both RDD and DataFrames.
For RDD, Schema should be supplied explicitly, where as data frame implicitly extracts the schema from the given data.
Part-3: Perform word count on Twitter streaming data
To use Twitter Streaming API, developer account is required which provides the credentials to download real time streaming data.
Stream is created and data is filtered.
Filtered data is divided into words.
Output:
Part-4: Spark GraphX
Wordgame dataset is used for this.
Input:
a. Pagerank: Used to find influential vertices and edges in a graph.
Dataset is loaded and vertices and edges are identified.
Graph frame is created based on vertices and edges.
Schema and vertices output:
Edges output:
Pagerank is calculated for the graph frame constructed above.
b. Importance of using Graphx
GraphX is Spark API to implement parallel and iterative graph processing.
Graph frame is constructed from vertices and edges.
References:
https://spark.apache.org/docs/latest/graphx-programming-guide.html#pagerank
https://www.toptal.com/apache/apache-spark-streaming-twitter
https://www.programcreek.com/scala/org.apache.spark.streaming.twitter.TwitterUtils
https://www.kaggle.com/anneloes/wordgame
https://www.edureka.co/community/12000/what-the-difference-between-rdd-and-dataframes-apache-spark