CASE4 - RoshiniVarada/BDP_Project2 GitHub Wiki

Hello and Welcome!

greetings, dear UMKC staff members, and welcome to this wiki where the process of solving and executing case no. 4 of project-based exam-2 in our beloved course, big data programming.

Zakari, Abdulmuhaymin. Class ID of 25

Team Members

Roshini Varada
Zakari, Abdulmuhaymin
Sarika Reddy Kota
Pallavi Arikatla

Introduction

Big data is the trend these days when it comes to major businesses specialized in technology development. In this course, the students were exposed to many kinds of platforms and techniques that should help facilitate the process of learning these technologies. In this exam, the students should start working on different tasks regarding Apache Spark, which is a platform that works on top of the Hadoop file system. in this wiki, we will be walking through the solving process of case 4, which is to perform page ranking techniques on selective datasets. So with no further due, let's get started!

Idea

The idea is the project is to perform many techniques based on spark platform to solve different kinds of issues that have some sort of relationship to big data in general. one of those cases is page ranking, and we will be solving such an issue in the wiki.

Algorithm

Page ranking is one of google's invention that helps to rank online websites based on how many pages there are pointing toward a specific page. Suppose we have a youtube link that many pages pointing toward it, this link will be assigned many ranks based on that, thus, it will get a higher probability of showing in the first results of the search engine results and so on.

Usage in the world today

Page ranking is one of the most well-knowing algorithms that widely and mainly used by search engines in order to rank every web page available on the internet. this will result in many useful outputs such as who gets to be placed on the first page on that search engine based on their high priorities and number of pointers toward one specific page.

Dataset

The instructor gave the student the chance to decide which dataset to work on among several selective datasets. The team has decided to work on the following dataset: https://www.kaggle.com/stkbailey/nashville-meetup

This dataset from Kaggle contains the meetup networks from the well-knowing website, meetup.com, it also contains the groups ids, names, and relationships between those groups. The thing that will be used later to design our graph in apache spark.

The two files we will be working on are the following:

This is the first file, and it has the edges information. As it's obvious, the edges information is reserved between the two columns, Group1 and Group2. The third column contains the edge weights and it's specified as an integer value.

The second file contains the meta group information. This file will also be used to grab the group's names based on their IDs to be represented well in the results later on!

Design

The GraphFrame library under spark will accept the previous datasets to make a frame that holds the data in a graph representation manner, the result will be similar to the following image:

The vertices will contain the IDs of those groups, and the edges will point from one vertex to another, taking the weights of those edges in consideration as well.

Technically speaking...

The following segment of this wiki will be about the workflow of the source code.

The first segment of the code is to import the required libraries and configure a spark's session. These are usual statements that need to be stated since we're using IntelliJ Idea environment to solve this problem. After doing that, we need to read the datasets we have in order to start grabbing the required columns

As it's mentioned, this is where we will be reading those files. Please keep in mind that they are available under the directory input/, and we're using the CSV reading method with the header option on to read these files flawlessly.

Next, it's time to initiate our graph! the first two lines will make us two data frames that hold the value of these columns assigned to them, we had to rename the column since the predefined method only accept columns with a specific naming scheme. After that, the graph will be called into a variable called myGraph, and the most important line, in this case, PageRank() method will be called to rank the vertices and edges of this particular graph. The tol() method accepts a floating value, and it's used to specify the convergence to tolerance variable in the PageRank algorithm, the smaller the number gets, the better ranks we obtain, etc. Lastly, the two last variables will hold the results from our data frame to be shown and well as saved later in a CSV file.

This is the last lines of our code where we will be printing the values or our data frame "results" and saving them into CSV files to be well represented.

Results

The results as it's shown contain the group names since the were compared to the meta group dataset, and the page ranking results, which is a new column that was generated by the algorithm we just run to possess the ranks of the vertices. As you can see, Agile Nashville is the winner with a whopping 18.14 score.

The second resulting file is the edges with their new weights. This is also a generated column to posses the new weights comparing to the previous values.

Challenges

This part of the project did not face any serious challenges thankfully, the only issue that appeared only my side is that the virtual image I'm using stopped working for some reason and I had to re-implement the environment on my windows machine, and thankfully that was flawless.

Milestones

This segment or statement of this project had two main milestones. The first one is the search part, where I was looking deeply in the dataset we have as well as the praphX library I'll be working on for the next few days. The second one is the implementation where I started working this problem directly and started connecting the dots, literally!

Video

Please follow this link to see a demonstration video for this wiki: https://youtu.be/7_oNwUCLTbU

Team members' contributions

Each member of the team handled one problem, and it's figured as the following:

Roshini varada -- Hadoop MapReduce Algorithm -- https://github.com/RoshiniVarada/BDP_Project2/wiki/CASE1
Sarika Reddy Kota -- Spark Data Frames -- https://github.com/RoshiniVarada/BDP_Project2/wiki/CASE2
Pallavi Arikatla -- Spark streaming -- https://github.com/RoshiniVarada/BDP_Project2/wiki/CASE3
Zakari, Abdulmuhaymin -- Spark Graphx -- https://github.com/RoshiniVarada/BDP_Project2/wiki/CASE4

Conclusion

This algorithm is pretty useful for many graph-based applications such as web search engines and bioinformatic studies. The previous solution showed quietly impressive results although this dataset is arguably humble. Graph-based applications are just one part of spark out of many parts that we can leverage to help us solve many issues in our big data world.