icp6m2 - gracesyl/big-data-hadoop GitHub Wiki

In continuation with the previous ICP this one covers the graph algorithms and some exclusive paths and PageRank in this session. •Breadth-first search (BFS) finds the shortest path(s) from one vertex (or a set of vertices) to another vertex (or a set of vertices)•The beginning and end vertices are specified as Spark DataFrame expressions. •PageRank(PR) is an algorithm used by Google Search to rank websites in the irsearch engineresults. PageRank was named afterLarry Page•PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important is the website. Computes the number of triangles passing through each vertex.----TRIANGLECOUNT. Those computation are as follows as mentioned in the lesson plan: 1.Import the dataset as a csv file and create data framesdirectly on import than create graph out of the data frame created with Triangle Count:

3.Find Shortest Paths w.r.t. Landmarks:

4.Apply Page Rank algorithm on the dataset:

5.Save graphs generated to a file.

##Bonus: 1.Apply Label Propagation Algorithm"

2.Apply BFS algorithm:

References:

1.https://spark.apache.org/docs/latest/graphx-programming-guide.html#vertex-and-edge-rdds

2.learn pyspark book.