CASE2 - RoshiniVarada/BDP_Project2 GitHub Wiki
Team Members and collaboration:
Roshini varada -- Hadoop MapReduce Algorithm
Sarika Reddy Kota -- Spark DataFrames
Pallavi Arikatla -- Spark streaming task
Zakari, Abdulmuhaymin -- Spark Graphx task
Idea:
i) To Create Dataframe on given csv file and run spark queries for pattern recognition, topic discussion on DataFrames.
ii)To Perform any 5 queries in Spark RDD’s and Spark Data Frames.
Usage of Project:
i) A Spark DataFrame is a distributed collection of data organized into named columns that provides operations to filter, group, or compute aggregates, and can be used with Spark SQL.
ii)DataFrames can be constructed from structured data files, existing RDDs, tables in Hive, or external databases.
Software Required:
Python3
Pyspark
Jupyter Notebook
Github
Implementation:
- For the initial setup, Install pyspark, findspark and then initialize the findspark.
a) Create a Spark DataFrame using one of datasetsandtry to use all different StructType.
- First Create a data frame using the world cup matches data set and then rename the columns.
- Rename the columns and then printing the schema.
- Now Change the required columns into integer data type
b) Perform 10 intuitive questions in Dataset
QUERY 1:
To find the top 5 home teams with highest number of goals.
QUERY 2:
To find the top 10 matches with highest number of goals.
QUERY 3:
To find the number of matches held year wise.
QUERY 4:
To find the number of matches in each group stage.
QUERY 5:
To find the teams with the count of winning matches.
QUERY 6:
To find the matches that held with extra time given.
QUERY 7:
To find the top 5 stadiums occupied with the highest number of spectators for the matches.
QUERY 8:
To find the matches those held on 8th June 1958.
QUERY 9:
To find the maximum all type of goals which are scored year wise.
QUERY 10:
To describe type of goals which gives count,mean,min and max.
QUERY 11:
To find the number of matches played in the year 1954.
c) Perform any 5 queries in Spark RDD’s and Spark Data Frames.
- Create RDD with the data set worldcup matches and ran queries on data frame.
QUERY 1:
To find the total number of goals scored by a winning team in all the matches.
QUERY 2:
To find the number of times each country has won the cup.
QUERY 3:
To find the countries with highest number of matches played in all the years.
QUERY 4:
To find the records with winner and the country as same.
QUERY 5:
To find the distinct countries in the records.
QUERY 6:
To fetch records where country is USA.
Challenges Faced:
It was bit challenging to write the queries which differentiate itself with other queries being used. I also tried to perform above queries which are different from one another.
Milestones and Integration of the Project:
We could successfully complete this project, which has four use cases. As there are 4 Use cases we have chosen based our choice and worked accordingly so, it was not difficult as we have worked independently.
Links of other Cases:
Use-case1: Hadoop Map reduce Algorithm
Project-link- https://github.com/RoshiniVarada/BDP_Project2/tree/master/CASE-1
Wiki-link- https://github.com/RoshiniVarada/BDP_Project2/wiki/CASE1
Use-case3: Spark streaming
Project-link- https://github.com/RoshiniVarada/BDP_Project2/tree/master/CASE-3
Wiki-link-- https://github.com/RoshiniVarada/BDP_Project2/wiki/CASE3
Use-case4: Spark Graphx
Project-link- https://github.com/RoshiniVarada/BDP_Project2/tree/master/CASE-4
Wiki-link-- https://github.com/RoshiniVarada/BDP_Project2/wiki/CASE4