Module 2: Lab #1 - SnehaMishra28/BigData_Programming_Summer2018 GitHub Wiki
Team: 12
Professor: Yugyung Lee
Name: Sneha Mishra
Class ID: 11
Email: [email protected]
MyGitHub
Technical Partner:
Name: Aditya Soman
Class ID: 19
Email: [email protected]
GitHub
YouTube Link explaining the Lab work can be found here
The report for the Lab work is here
The code for this lab work can be found here
Objective
Understanding Hadoop MapReduce Algorithm and Spark Data Frames better and perform complex intuitive queries on the created database.
Features
- Run the MapReduce job on Apache Spark.
- Create a report including your algorithm and result screenshots to find Facebook common friends.
- Create a Spark DataFrame using one of datasets.
- Perform 10 intuitive queries in Database.
- Perform any 5 queries in Spark RDD’s and Spark Data Frames.
- Compare the results.
Steps:
Part 1: Hadoop MapReduce Algorithm Use Case
Execution:
Input:
Output:
Part 2: Spark Data Frames Use Case
here
Another dataset needs to be created to perform some operations such as Join, Union etc on the given Dataset. Created dataset can be foundPart 2.a) Create a Spark DataFrame:
Query: cricket_df = spark.read.option("header", "true").csv("/Users/snehamishra/Downloads/cric.csv")
here
The Cricket DataFrame can be foundhere
The football DataFrame can be foundFootball DataFrame:
Cricket DataFrame:
Part 2.b) Perform 10 intuitive queries:
Query 1: Total count of football World Cups
Query
count = football_df.count()
Output -
Query 2: Maximum number of world cup wins by any team
Query
football_df.groupBy(['Winner']).count().orderBy("count", ascending=False).show(1)
Output -
Query 3: Most un-lucky team(country) in the history of World Cups (Runner up)
Query
football_df.groupBy(['Football Runners-Up']).count().orderBy("count", ascending=False).show()
Output -
Query 4: Most number of times a country has hosted
Query
football_df.groupBy(['Countries']).count().orderBy("count", ascending=False).show(1)
Output -
Query 5: Percentage Increase in the countries playing the soccer
Query -
Output -
Query 6: Union of Football Dataframe with Cricket Dataframe
Query -
Output -
Query 7: Joining of Football Dataframe with Cricket Dataframe on Host Countries
Query -
Output -
Query 8: When the football winner and the cricket winner were same team
Query
football_df.join(cricket_df, football_df['Football Winner'] == cricket_df['Cricket Runner-Up']).show(1)
Output -
Query 9: Difference in years between the 1st football world cup and the first Cricket World Cup
Query -
Output -
Query 10: Number of distinct countries who have won the world cup
Query -
Output -
Part 2.c) Perform any 5 queries in Spark RDD’s and Spark Data Frames:
Query 1: Join of Football Dataframe with Cricket Dataframe on Host Countries
Query - footballrdd.join(cricketrdd, "Countries").rdd.saveAsTextFile("joinop")
Output -
Query 2: Union Football Dataframe with Cricket Dataframe
Query - footballrdd.unionAll(cricketrdd).rdd.saveAsTextFile("unionop")
Output -
Query 3: distinct countries that have won the Football World Cup
Query - footballrdd.unionAll(cricketrdd).rdd.saveAsTextFile("unionop")
Output -
Query 4: Count of Number of World Cups
Query - print(footballrdd.count())
Output -
Query 5: When the football winner and the cricket winner were same team
Query - print(footballrdd.select('Football Winner').distinct().count())
Output -
Output Folder:
References:
FIFA World Cup Dataset:
https://www.kaggle.com/abecklas/fifa-world-cup#WorldCupMatches.csv
Kickstarter Projects Dataset:
https://www.kaggle.com/kemical/kickstarter-projects