LAB 2 WIKI - navyagonug/CS5590-BIG-DATA-PROGRAMMING-USING-HADOOP-AND-SPARK GitHub Wiki

QUESTION 1

OBJECTIVE

Finding facebook common friends using apache spark. In this approach we find common friends between people. If we take any basic friend list of 4 persons such as person friends Ritika Jaya, Lohitha, Aishu

Jaya            Lohitha, aishu

Lohitha         Ritika, aishu

Aishu           Jaya, ritika

Here we can see that ritika has jaya, lohitha and aishu in her friends list and jaya has lohitha and aishu in their friends list. Now the common friend between ritika and jaya are lohitha, aishu. As explained, we compute in the same manner for the rest of the persons present above.

This phases are done in two parts that is mapper and reducer phase. In the mapper function, first the line is splited by using " ". in mapper phase we give each friend its key and value. Here the slice method is used on words and it takes start and end index and will return the new collection with elements that are within the start and end index range which is taken from 1 to words size. From there the mapping starts and are passed on to the reducer phase.

Next comes the reducer phase where the data is grouped based on the key values.Then the list is produced and we get the mutual friends. Here accumulator is used to intersect the data and find mutual friends.

In the last the loaded dataset is given to the file and run the mapper and reducer functions and the output data is stored into the another file.

DATASETS USED

No specific dataset is used for this problem. An input file is created on our own and the total work is done

WORKFLOW

MAPPER PHASE

REDUCER PHASE

MAIN CLASS

PARAMETERS

INPUT

OUTPUT

CONFIGURATIONS

Here we have added changes to the build.sbt file and the screenshot is shown below.

EVALUATION

Apache spark in scala needs less number of lines compared to map reduce program and is fast.

CONTRIBUTIONS

We are a team of three and we worked collaboratively through out the lab assignment. Since, these questions are not exhaustive in nature, Each of us divided the problems and worked them out. Niteesha is responsible for loading dataset and for implementing the code. However, Every one of us have eloquent knowledge in every problem given.

CONCLUSION

Time difference is present here and spark in Scala programs saves time and runs faster rather than the traditional map reduce program.

QUESTION 2

Objective-

There are three datasets given. Using one of them,

  1. we need to create a spark dataframe and perform all the structtypes.

  2. write 10 queries to show different patterns or operations on the dataset.

  3. perform any 5 queries in Spark RDD’s and Spark Data Frames.

Datasets used-

The following set of datasets are used for solving this problem.

1.FIFA World Cup: https://www.kaggle.com/abecklas/fifa-world-cup#WorldCupMatches.csv

Workflow-

Part a -

The basic code for creating a spark dataframe on a dataset involves importing a dataset using its corresponding location and defining the format of the file. Here, the dataset that is considered is related to FIFA WORLDCUP.

The following snippet shows the code for creation of a spark dataframe and different structtypes for corresponding dataset.

Here different fields are defined using keyword struct that helps in different queries.

Part b-

QUERY 1: The following snippet shows a query which is used to display top countries which has maximum attendance.

The following snippet shows the output of the above query

QUERY 2: The following snippet shows a query which is used to show the number of times Italy is the winner.

The following snippet shows the output of the above query

QUERY 3: The following snippet shows a query which is used to display the names of the countries which had a goal in the descending order along with year in which it occured.

The following snippet shows the output of the above query

QUERY 4: The following snippet shows a query which is used to display the list of countries that had goals more than 50.

The following snippet shows the output of the above query

QUERY 5: The following snippet shows a query which is used to show the number of matches that are played in that particular year.

The following snippet shows the output of the above query

QUERY 6: The following snippet shows a query for performing a join operation on the datasets of players and matches combining few columns in each.

The following snippet shows the output of the above query

QUERY 7: The following snippet shows a query that is used to list the count that a particular person is coach of a particular country. Here, the country selected is USA.

The following snippet shows the output of the above query

QUERY 8: The following snippet shows a query that is used to show the number of matches played in a stadium along with the name if listed.

The following snippet shows the output of the above query

QUERY 9: The following snippet shows a query that is used to list the name of the captains of all countries

The following snippet shows the output of the above query

QUERY 10: The following snippet shows a query that is used to list all the data of the games played by Brazil in the year 1930

The following snippet shows the output of the above query

Part c-

QUERY 1: The following snippet shows a query that is used to display the data relating to the particular year 1998 among the entire dataset. It consists of the code using dataframe and using datafeame-sql query.

The following snippet shows the output of the above query

QUERY 2: The following snippet shows a query that is used to show the number of times a particular country is winner of the worldcup in the descending order.

The following snippet shows the output of the above query

QUERY 3: The following snippet shows a query that is used to display the data of the countries which has a goal score of 89.

The following snippet shows the output of the above query

QUERY 4: The following snippet shows a query that is used to display the maximum qualified teams in the worldcup.

The following snippet shows the output of the above query

QUERY 5: The following snippet shows a query that is used to show the data of the countries that are in third place

The following snippet shows the output of the above query

CONFIGURATIONS

The following modifications has been made to build.sbt file

EVALUATION

It is evident from above queries execution that Dataframes performance is far superior than RDD. Some key comparisions are as follows.

Optimization :​​ Dataframe has in-built catalyst optimization where as RDD does not possess this.

Type Safety :Dataframes gives runtime safety whereas RDD's provides compile time safety.

Aggregation : ​​Aggregation is faster in Dataframe where as slower in RDD

Garbage Collection : ​​Dataframe avoids garbage collection whereas RDD's succumbs to it.

CONCLUSION

Thus usage of structtype and execution of various innovative and interesting queries has been successfully executed.

CONTRIBUTIONS

We are a team of three and we worked collaboratively through out the lab assignment. Since, these questions conists of various queries, each of us divided the problems and worked them out. As there are about 10 queries in part b - each of us has thorough discussion and understanding on the type of queries we have written. Niteesha is responsible for loading dataset and for implementing the code. Navya has executed about half the queries and Divya has executed the remaining half queries. This execution is done with brief understanding of each and every query by every one of us.

REFERENCES

PROBLEM 3

OBJECTIVE

The main objective of this question deals with performing Word-Count on Twitter data using Spark.

APPROACH

In order to collect Twitter data and to perform any actions over it, One must create a Twitter developer account. After creating an account and obtaining required permissions from Twitter, One is granted with permissions to have an access to Consumerkey and Access tokens. The following are is the screenshot for the tokens and keys generated and allowed to be used to access the Twitter data.

Now that the access for Tweets is obtained, A Spark Streaming API is used which consumes the Twitter data and streams this data. Spark Streaming context connects to Twitter and Streams this data with a specified duration. The streamed data is then passed on which Map and Reduce Operations are performed by which wordcount is obtained on the data. In our code, We specifically collected Hashtags data from Tweets by using a filter. WordCount is then performed swiftly on this filtered hashtag data.

WORKFLOW

Initially, A Spark Configuration and Spark Context as set. Later the Consumerkey, ConsumerSecret, AccessToken, AccessSecretToken are passed as an array through arguments.

Twitter4j library is used to integrate the application with Twitter services. With the help of OAuth support , The system properties are set for keys and access tokens.

Spark Streaming is then created and TwitterUtils will make use of Twitter4j to get stream of twitter data. Using this TwitterUtils, Streaming of data is done. The extracted stream has a filter that only streams the data which begins with hashtag (Hashtags are quite popular among Twitter users)

Finally, Map and ReducebyKey operations are performed inorder to apply wordcount logic on the streamed data.

DATASETS USED

Here, There is no particular data is used. However, Permissions to collect data from Twitter and performing operations on it is done.

CONFIGURATIONS

Few library dependencies have been added to build.sbt file in order to enable Twitter data streaming.

PARAMETERS

INPUT

The input data is Streamed Twitter data.

OUTPUT

To obtain an output for this program, sbt shell is used and the keys and access tokens are passed as shown in the following screenshot. Once after these keys are passed, The program starts running, The hashtags are streamed and the output (wordcount) is generated as follows.

EVALUATION

From the below screenshot, It is evident that the code runs smoothly and swiftly when it comes to streaming of data and generation of WordCount. Additionally, The time for the entire process is also quite less which makes the code efficient.

CONCLUSION

Streaming of Twitter data(Hashtags) and subsequent wordcount is successfully performed.

CONTRIBUTION

We are a team of three and we worked collaboratively through out the lab assignment. Since, these questions are not exhaustive in nature, Each of us divided the problems and worked them out. Navya is responsible for collecting Twitter access tokens and for implementing the code. However, Every one of us have eloquent knowledge in every problem given.

REFERENCES

  1. https://gist.github.com/varadharajan/ac4d30e415ea050b7407102778891bba
  2. https://www.coursera.org/lecture/big-data-analysis/counting-common-friends-part-i-r5EmL

PROBLEM 4

OBJECTIVE

The main objective of this problem is to perform pagerank on a graph by building a graph.

APPROACH

In this, Graphframes are used for creating a graph. Vertices are edges are created by deriving them from a given dataset.

WORKFLOW

Initally, A sparkcontext and configurations are set as shown below.

For generation of graph, We have considered two different csv files from the provided datasets. The following screenshot depicts loading of two different datasets and creating dataframes.

In order to generate a graph, One requires vertices and edges. We came up with a simple thought of creating graph by using groupid (unique values) as vertices and group1,group2 columns as edges from group-edges.csv file. The following cide snippet depicts the creation of vertices and edges.

As it can be seen from the above snippet, The groupid is renamed as id and the edges(group1,group2) are renamed as src and dst.

After vertices and edges creation, Graph is created using GraphFrame as follows.

Finally, PageRank is performed on the graph as shown below

DATASETS USED

The following is link for the datasets that we have used for the purpose of graph creation. For graph vertices, The following CSV file is used. https://www.kaggle.com/stkbailey/nashville-meetup#meta-groups.csv

For, graph edges, The following CSV file is used. https://www.kaggle.com/stkbailey/nashville-meetup#group-edges.csv

PARAMETERS

INPUT

Two CSV files are given as an input for forming graphs

OUTPUT

A graph is generated and pagerank algorithm is performed on it.

EVALUATION

A graph is successfully created with the help of graphframe and the pagerank is generated for both vertices and edges.

CONCLUSION

Thus, Creation of graph using given datasets and performing specified algorithm is successful.

CONTRIBUTION

We are a team of three and we worked collaboratively through out the lab assignment. Since, these questions are not exhaustive in nature, Each of us divided the problems and worked them out. Divya is responsible for creating the graphs and performing PageRank algorithm on it. However, Every one of us have eloquent knowledge in every problem given.

REFERENCES

  1. https://docs.databricks.com/spark/latest/graph-analysis/graphframes/graph-analysis-tutorial.html
  2. https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-scala.html

QUESTION 4 (PART II)

GraphX is Apache Spark API for creation of graphs. It is mainly used for distruibuted processing of graphs. For an instance, If there is a very large graph present with large number of vertices and edges, If these graphs are difficult to process on just a single machine, One can make use of GraphX in order to parallelize computations. In our case, Group ID is used for creating vertices and group1, group2 from group-edges.csv file is used for creating edges for these graphs.Finally, Graph is formed as it can be seen above. Using this evidently also provides high speed.