Graph Frames 2 - praveenpoluri/Big-Data-Programing GitHub Wiki
Aim:
Create Dataframe, graphframe on given dataset, calculate triangle count, shortest path, pagerank file generation and saving generated graphs to a file, applying label propagation algorithm.
Introduction:
About GraphFrames:
GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs. It provides high-level APIs in Scala, Java, and Python. It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames. This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graph queries.
What is a Graphframe ?
GraphX is to RDDs as GraphFrames are to DataFrames. GraphFrames represent graphs: vertices (e.g., users) and edges (e.g., relationships between users). If you are familiar with GraphX, then GraphFrames will be easy to learn. The key difference is that GraphFrames are based upon Spark DataFrames, rather than RDDs. GraphFrames also provide powerful tools for running queries and standard graph algorithms. With GraphFrames, you can easily search for patterns within graphs, find important vertices, and more. Refer to the User Guide for a full list of queries and algorithms.
GraphFrames make it easy to express queries over graphs. Since GraphFrame vertices and edges are stored as DataFrames, many queries are just DataFrame (or SQL) queries.
Examples of Graph frames: Example: How many users in our social network have “age” > 35? We can query the vertices DataFrame: g.vertices.filter("age > 35")
Example: How many users have at least 2 followers? We can combine the built-in inDegrees method with a DataFrame query. g.inDegrees.filter("inDegree >= 2")
**Example on Triangle Count: ** Computes the number of triangles passing through each vertex.
import org.graphframes.{examples,GraphFrame} val g: GraphFrame = examples.Graphs.friends // get example graph
val results = g.triangleCount.run() results.select("id", "count").show()
Graphframe Pagerank algorithm:
There are two implementations of PageRank.
The first one uses the org.apache.spark.graphx.graph interface with aggregateMessages and runs PageRank for a fixed number of iterations. This can be executed by setting maxIter. The second implementation uses the org.apache.spark.graphx.Pregel interface and runs PageRank until convergence and this can be run by setting tool. Both implementations support non-personalized and personalized PageRank, where setting a sourceId personalizes the results for that vertex.
Tools:
- Pycharm
- Python
- Spark
- GraphX
- Graphframes.
- Jupyter Notebooks.
Implementation:
- Imported all the required libraries for spark, dataframes and graphframes, created sparkcontext as shown below.
- Created sql context, dataframes on given CSV files and created tempview for both dataframes using Spark API as shown below.
- Created vertex, edge dataframe which is to be passed to graphframe api for graphframe creation.
- Created graphframes on the dataframes using the graphframes api as shown below.
- Calculated the Triangle count using Trianglecount api of graphframes and viewed it.
- Calculated shortest path using two specific landmarks from Landmarks column using shortestPaths api of graphframes.
- Applied pagerank algorithm on the graph with resetProbability and max number of iterations set as shown and viewed edges and vertices with their respective pagerank.
-
Saved thee generated garphs to a file using follwoing api.
-
Bonus:
-
Label Propagation using Label propagation api and showing it.
- viewing the path for id column as shown:
Limitations:
- API not supported for most languages.
- Motifs are not allowed to contain edges without any named elements: "()-[]->()" .
Conclusion:
Created Dataframe, graphframe on given dataset, calculated triangle count, shortest path, pagerank file generation and saved generated graphs to a file, applied label propagation algorithm.