CASE2 - RoshiniVarada/BDP_Project2 GitHub Wiki

Team Members and collaboration:

Roshini varada -- Hadoop MapReduce Algorithm

Sarika Reddy Kota -- Spark DataFrames

Pallavi Arikatla -- Spark streaming task

Zakari, Abdulmuhaymin -- Spark Graphx task

Idea:

i) To Create Dataframe on given csv file and run spark queries for pattern recognition, topic discussion on DataFrames.

ii)To Perform any 5 queries in Spark RDD’s and Spark Data Frames.

Usage of Project:

i) A Spark DataFrame is a distributed collection of data organized into named columns that provides operations to filter, group, or compute aggregates, and can be used with Spark SQL.

ii)DataFrames can be constructed from structured data files, existing RDDs, tables in Hive, or external databases.

Software Required:

Python3

Pyspark

Jupyter Notebook

Github

Implementation:

  • For the initial setup, Install pyspark, findspark and then initialize the findspark.

a) Create a Spark DataFrame using one of datasetsandtry to use all different StructType.

  • First Create a data frame using the world cup matches data set and then rename the columns.

  • Rename the columns and then printing the schema.

  • Now Change the required columns into integer data type

b) Perform 10 intuitive questions in Dataset

QUERY 1:

To find the top 5 home teams with highest number of goals.

QUERY 2:

To find the top 10 matches with highest number of goals.

QUERY 3:

To find the number of matches held year wise.

QUERY 4:

To find the number of matches in each group stage.

QUERY 5:

To find the teams with the count of winning matches.

QUERY 6:

To find the matches that held with extra time given.

QUERY 7:

To find the top 5 stadiums occupied with the highest number of spectators for the matches.

QUERY 8:

To find the matches those held on 8th June 1958.

QUERY 9:

To find the maximum all type of goals which are scored year wise.

QUERY 10:

To describe type of goals which gives count,mean,min and max.

QUERY 11:

To find the number of matches played in the year 1954.

c) Perform any 5 queries in Spark RDD’s and Spark Data Frames.

  • Create RDD with the data set worldcup matches and ran queries on data frame.

QUERY 1:

To find the total number of goals scored by a winning team in all the matches.

QUERY 2:

To find the number of times each country has won the cup.

QUERY 3:

To find the countries with highest number of matches played in all the years.

QUERY 4:

To find the records with winner and the country as same.

QUERY 5:

To find the distinct countries in the records.

QUERY 6:

To fetch records where country is USA.

Challenges Faced:

It was bit challenging to write the queries which differentiate itself with other queries being used. I also tried to perform above queries which are different from one another.

Milestones and Integration of the Project:

We could successfully complete this project, which has four use cases. As there are 4 Use cases we have chosen based our choice and worked accordingly so, it was not difficult as we have worked independently.

Links of other Cases:

Use-case1: Hadoop Map reduce Algorithm

Project-link- https://github.com/RoshiniVarada/BDP_Project2/tree/master/CASE-1

Wiki-link- https://github.com/RoshiniVarada/BDP_Project2/wiki/CASE1

Use-case3: Spark streaming

Project-link- https://github.com/RoshiniVarada/BDP_Project2/tree/master/CASE-3

Wiki-link-- https://github.com/RoshiniVarada/BDP_Project2/wiki/CASE3

Use-case4: Spark Graphx

Project-link- https://github.com/RoshiniVarada/BDP_Project2/tree/master/CASE-4

Wiki-link-- https://github.com/RoshiniVarada/BDP_Project2/wiki/CASE4

VIDEO LINK:

https://www.youtube.com/watch?v=Ii-m_Zb3QCs