ICP 13 - Murarishetti-Shiva-Kumar/Big-Data-Programming GitHub Wiki

Lesson Plan13: Graph Frames and GraphX Algorithms

Lesson Plan Description: Distributed Collection of Data

Part – 1:

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs. It provides high-level APIs in Scala, Java, and Python. It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames. This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graph queries.

image

1. Import the dataset as a csv file and create data frames directly on import than create graph out of the data frame created.

image

image

image

image

2. Triangle Count

Triangle count do compute the number of triangles passing through each vertex. The algorithm for the triangle cout is as follows:

  • First, Computes set of neighbors for each vertex
  • For each edge it computes intersection of sets and sends the count to both vertices.
  • Then Compute the sum at each vertex and then we divide by two since each triangle is counted twice. In this below code we have used trianglecount.run() to remove if any duplicates or any self edges are there.

image

image

3. Find Shortest Paths w.r.t. Landmarks

Shortest path algorithms are usually greedy in nature. at the first step calculate the distance between current node to next node. move the pointer with weight added to all neighbor nodes. keep doing the same until the result node is arrived. having sum of values between all nodes, we should be able to find out shortest of all paths until the last node is met.

image

image

4. Apply Page Rank algorithm on the dataset.

Page rank algorithm is the algorithm used by google search to rank weights in their search engines and this is named after larry page. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important is the website.In this below code we run the page rank until convergence to tolerance and then displaying the final results.

image

image

image

5. Save graphs generated to a file.

In this below code the created graphs are saved into the csv files

image

image

Bonus:

1. Apply Label Propagation Algorithm

It is a semi-supervised algorithm that assigns labels to previously unlabeled data points. That is what we call as Label Propagation. Initially, a (generally small) subset of the data points has labels. Afterwards, those labels propagate to further unlabeled points throughout the algorithm.

image

image

image

2. Apply BFS algorithm

Breadth-first search is an algorithm for traversing tree or graph data structures and it starts at the tree root, and explores all of the neighbor nodes at the present depth prior to moving on to the nodes at the next depth level. The below code in which we have executed the bfs tree on the id column.

image

image