ICP_Module2_Assignment_6 - MadhuriSarode/BDP GitHub Wiki
Madhuri Sarode : 24
Installation GraphFrames jar is installed and added to the project as an external jar
GraphFrames is a package for Apache Spark that provides DataFrame-based graphs. It provides high-level APIs in Java, Python, and Scala. It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames. This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graph queries. You can create GraphFrames from vertex and edge DataFrames.
Assignment - Part 1
1)Import the dataset as a csv file and create data frames directly on import than create graph out of the data frame created.
2)Triangle Count
Triangle Count is very useful in social network analysis. The triangle is a three-node small graph, where every two nodes are connected. It counts the triangles passing through each vertex using a straightforward algorithm:
Compute the set of neighbors for each vertex; For each edge compute the intersection of the sets and send the count to both vertices; Compute the sum at each vertex and divide by two since each triangle is counted twice.
3)Find Shortest Paths w.r.t. Landmarks between San Jose Diridon Caltrain Station and San Jose Civic Center
4)Apply Page Rank algorithm on the dataset.
The PageRank algorithm outputs a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page.
PageRank algorithm implementation. There are two implementations of PageRank implemented.
The first implementation uses the standalone Graph
interface and runs PageRank for a fixed number of iterations.
The second implementation uses the Pregel
interface and runs PageRank until convergence
5)Save graphs generated to a file.
The graph details are stored into edges and vertices files.
Bonus question
1)Apply Label Propagation Algorithm
Label Propagation Algorithm (LPA) is a simple and fast method for detecting communities within graphs. By construction, the communities obtained by the label propagation process require each node to have at least as many neighbors within its community as it has with each of the other communities.
2)Apply BFS algorithm