LAB ASSIGNMENT 2 - bhargavi1411/BigDataProgramming GitHub Wiki

Team Member 1 : Madhuri Sarode

Class id : 24

Team Member 2 : Bhargavi Saipoojitha Chennupati

Class id : 4

Team Member 3 : Bhavana Deepthi Kota

Class id : 16

Task 1 :

To implement MapReduce algorithm for finding Facebook common friends problem and run the MapReduce job on Apache Spark.

Code Screenshots :

Input :

Output :

Task 2 :

a.Create a Spark DataFrame using one of datasetsandtry to use all different StructType.

b.Perform 10 intuitive questions in Dataset (e.g.: pattern recognition, topic discussion, most important terms, etc.). Use your innovation to think out of box.

c.Perform any 5 queries in Spark RDD’s and Spark Data Frames. Compare the results.( for e.g. Selecting no of times Brazil won the World Cup , Selecting the Argentina WorldCup statistics etc)

output :

Task 3 :

Perform Word-Count on Twitter Streaming Data using Spark.

Code Screenshots :

The twitter tweets are streamed using python program. The tweets are in english and for the keywords sprak, hadoop etc. The tweets are streamed using socket on localhost 9999

The Scala program which reads the tweets from the port is as shown below. The program listens on port 9999 and reads the tweets, splits each line using delimiter space " " and then counts them.

Each word against its count as appeared in the tweets is printed

Task 4 :

Perform the following tasks

a.Perform Page Rank

b.State importance of using graphx on the chosen dataset.

Code Screenshots :

Output :

  1. State importance of using graphx on the chosen dataset. For group-data in scala, The goal of the GraphX project is to unify graph-parallel and data-parallel computation in one system with a single composable API. The GraphX API enables users to view data both as graphs and as collections (i.e., RDDs) without data movement or duplication.You can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently.GraphX is Apache Spark's API for graphs and graph-parallel computation. It extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. GraphX library provides graph operators like subgraph, joinVertices, and aggregateMessages to transform the graph data. It provides several ways of building a graph from a collection of vertices and edges in an RDD or on disk. GraphX also includes a number of graph algorithms and builders to perform graph analytics tasks.