GraphFrame PySpark - cchantra/bigdata.github.io GitHub Wiki
Use python with graph. We need graphFrame.
(https://spark-packages.org/package/graphframes/graphframes)
Select the right version of your spark.
(e.g. my spark version is 3.1.1)
wget https://repos.spark-packages.org/graphframes/graphframes/0.8.2-spark3.1-s_2.12/graphframes-0.8.2-spark3.1-s_2.12.jar
unpack it. Assume at /home/hadoop
jar xf graphframes-0.8.2-spark3.1-s_2.12.jar
You get the folder graphframes under /home/hadoop
mv graphframes-0.8.2-spark3.1-s_2.12.jar /home/hadoop/spark/jars
Next,
use jars.tar.gz given below.
wget https://github.com/cchantra/bigdata.github.io/raw/refs/heads/master/spark/jars.tar.gz
untar it and copy to /home/hadoop/.ivy2/jars
tar xvf jars.tar.gz
mkdir -p ~/.ivy2/jars
mv jars ~/.ivy2/jars
(* ./ivy2/jars may be created ** if not exists.)
(it is the default cache. ".ivy2/jars" to store downloaded dependency)
Then
add pythonpath
in /home/hadoop/spark/conf/spark-env.sh add the line.
(if /home/hadoop/spark/conf/spark-env.sh
not exist, create one by
cp /home/hadoop/spark/conf/spark-env.sh.template /home/hadoop/spark/conf/spark-env.sh
obtain the library
wget https://github.com/cchantra/bigdata.github.io/raw/refs/heads/master/spark/graphframes-0.8.2-spark3.1-s_2.12.jar
move to spark folder
mv graphframes-0.8.2-spark3.1-s_2.12.jar ~/spark/jars
add the line vi /home/hadoop/spark/conf/spark-env.sh
export PYTHONPATH=$PYTHONPATH:/home/hadoop/.ivy2/jars:/home/hadoop/spark/jars:.
export PATH=$PATH:/home/hadoop/.local/bin
export SPARK_HOME=/home/hadoop/spark
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.8.1-src.zip:$PYTHONPATH
Invoke pyspark at the current directory where your graphframes folder is below.
pyspark
OR *** This does not work any more since it's repo is down ***
you can put option port if you have problem with port conflict.
pyspark --jars /home/hadoop/spark/jars/graphframes-0.8.2-spark3.1-s_2.12.jar --conf spark.ui.port=4099
OR
pyspark --packages graphframes:graphframes:0.8.2-spark3.1-s_2.12 --conf spark.ui.port=4099
Following the instructions in https://graphframes.github.io/quick-start.html
1.We need to prepare two Data Frames, one for edges and one for vertices (nodes). The DataFrames can be constructed from a set of manually-type given data points (which is ideal for testing and small set of data), or from a given Hive query or simply constructing DataFrame from a CSV (text file) using the approaches explained in the first post
(CSV -> RDD -> DataFrame).
Try Example
(https://towardsdatascience.com/graphframes-in-jupyter-a-practical-guide-9b3b346cebc5)
To import
>>> from graphframes import *
create nodes dataframe
>>>
#create edges dataframe
>>> #construct graphframe
>>v = spark.createDataFrame([('1', 'Carter', 'Derrick', 50),
('2', 'May', 'Derrick', 26),
('3', 'Mills', 'Jeff', 80),
('4', 'Hood', 'Robert', 65),
('5', 'Banks', 'Mike', 93),
('98', 'Berg', 'Tim', 28),
('99', 'Page', 'Allan', 16)],
['id', 'name', 'firstname', 'age'])
>> e = spark.createDataFrame([('1', '2', 'friend'),
('2', '1', 'friend'),
('3', '1', 'friend'),
('1', '3', 'friend'),
('2', '3', 'follows'),
('3', '4', 'friend'),
('4', '3', 'friend'),
('5', '3', 'friend'),
('3', '5', 'friend'),
('4', '5', 'follows'),
('98', '99', 'friend'),
('99', '98', 'friend')],
['src', 'dst', 'type'])
>>> g = GraphFrame(v, e)
>>>
The basic graph functions that can be used in PySpark are the following:
vertices
edges
inDegrees
outDegrees
degrees
#show indegree
>>> g.inDegrees.show()
+---+--------+
+---+--------+
| id|inDegree|
+---+--------+
| 1| 2|
| 2| 1|
| 3| 4|
| 4| 1|
| 5| 2|
| 98| 1|
| 99| 1|
+---+--------+
#count edges with "follow" label
>>> g.edges.filter("type = 'follows'").count()
2
>>> g.vertices.show()
+---+------+---------+---+
| id| name|firstname|age|
+---+------+---------+---+
| 1|Carter| Derrick| 50|
| 2| May| Derrick| 26|
| 3| Mills| Jeff| 80|
| 4| Hood| Robert| 65|
| 5| Banks| Mike| 93|
| 98| Berg| Tim| 28|
| 99| Page| Allan| 16|
+---+------+---------+---+
>>> g.edges.show()
+---+---+-------+
|src|dst| type|
+---+---+-------+
| 1| 2| friend|
| 2| 1| friend|
| 3| 1| friend|
| 1| 3| friend|
| 2| 3|follows|
| 3| 4| friend|
| 4| 3| friend|
| 5| 3| friend|
| 3| 5| friend|
| 4| 5|follows|
| 98| 99| friend|
| 99| 98| friend|
+---+---+-------+
#sort graph
>>> inDegreeDF=g.inDegrees
>>> outDegreeDF=g.outDegrees
>>> degreeDF=g.degrees
>>>
>>> inDegreeDF.sort(['inDegree'],ascending=[0]).show() # Sort and show
---+--------+
| id|inDegree|
+---+--------+
| 3| 4|
| 5| 2|
| 1| 2|
| 4| 1|
| 99| 1|
| 98| 1|
| 2| 1|
+---+--------+
>>> outDegreeDF.sort(['outDegree'],ascending=[0]).show()
+---+---------+
| id|outDegree|
+---+---------+
| 3| 3|
| 1| 2|
| 2| 2|
| 4| 2|
| 5| 1|
| 98| 1|
| 99| 1|
+---+---------+
The graph algorithms which so far have been introduced to Spark and are super handy are the followings :
Motif finding ()
Subgraphs
Breadth-first search (BFS)
Connected components
Strongly connected components
Label Propagation
PageRank
Shortest paths
Triangle count
Example: Page rank
#find page rank
>>> results = g.pageRank(resetProbability=0.01, maxIter=20)
>>> results.vertices.select("id", "pagerank").show()
+---+-------------------+
| id| pagerank|
+---+-------------------+
| 3| 1.9926565142372492|
| 98| 1.0000000000000004|
| 99| 1.0000000000000004|
| 5| 0.9980278491870183|
| 1| 0.890802837490737|
| 4| 0.66757817627641|
| 2|0.45093462280858443|
+---+-------------------+
# Run PageRank personalized for vertex "1"
>>> results3 = g.pageRank(resetProbability=0.15, maxIter=10, sourceId="1")
>>> results3.vertices.select("id", "pagerank").show()
+---+-------------------+
| id| pagerank|
+---+-------------------+
| 1|0.30079283470815427|
| 3|0.33891806394017526|
| 2| 0.1272219398157718|
| 4|0.09616868544059574|
| 98| 0.0|
| 5|0.13689847609530284|
| 99| 0.0|
+---+-------------------+
# Run pagerank until convergence
>>>> results = g.pageRank(resetProbability=0.15, tol=0.01)
>>> results.vertices.select("id", "pagerank").show()
+---+------------------+
| id| pagerank|
+---+------------------+
| 1|0.9055074972891308|
| 3| 1.853919642738813|
| 2|0.5377967999474921|
| 4|0.6873519241384106|
| 98|1.0225331112091938|
| 5|0.9703579134677663|
| 99|1.0225331112091938|
+---+------------------+
Example: Motif finding:
http://graphframes.github.io/user-guide.html#motif-finding
Motif finding refers to searching for structural patterns in a graph.
GraphFrame motif finding uses a simple Domain-Specific Language (DSL) for expressing structural queries. For example, graph.find("(a)-[e]->(b); (b)-[e2]->(a)") will search for pairs of vertices a,b connected by edges in both directions. It will return a DataFrame of all such structures in the graph, with columns for each of the named elements (vertices or edges) in the motif.
## motif finding
>>> from graphframes.examples import Graphs
>>> g = Graphs(sqlContext).friends() # Get example graph
>>> g.edges.show()
+---+---+------------+
|src|dst|relationship|
+---+---+------------+
| a| b| friend|
| b| c| follow|
| c| b| follow|
| f| c| follow|
| e| f| follow|
| e| d| friend|
| d| a| friend|
+---+---+------------+
>>> g.vertices.show()
+---+-------+---+
| id| name|age|
+---+-------+---+
| a| Alice| 34|
| b| Bob| 36|
| c|Charlie| 30|
| d| David| 29|
| e| Esther| 32|
| f| Fanny| 36|
+---+-------+---+
# Search for pairs of vertices with edges in both directions between them.
>>> motifs = g.find("(a)-[e]->(b); (b)-[e2]->(a)")
>>> motifs.show()
+----------------+--------------+----------------+--------------+
| a| e| b| e2|
+----------------+--------------+----------------+--------------+
|{c, Charlie, 30}|{c, b, follow}| {b, Bob, 36}|{b, c, follow}|
| {b, Bob, 36}|{b, c, follow}|{c, Charlie, 30}|{c, b, follow}|
+----------------+--------------+----------------+--------------+
# More complex queries can be expressed by applying filters.
>>> motifs.filter("b.age > 30").show()
+----------------+--------------+------------+--------------+
| a| e| b| e2|
+----------------+--------------+------------+--------------+
|{c, Charlie, 30}|{c, b, follow}|{b, Bob, 36}|{b, c, follow}|
+----------------+--------------+------------+--------------+
we can try to find the mutual friends for any pair of users a and c. In order to be a mutual friend b, b #must be a friend with both a and c (and not just followed by c, for example).
>>> mutualFriends = g.find("(a)-[]->(b); (b)-[]->(c); (c)-[]->(b); (b)-[]->(a)").dropDuplicates()
>>> mutualFriends.filter('a.id == 2 and c.id == 3').show()
+--------------------+--------------------+--------------------+
| a| b| c|
+--------------------+--------------------+--------------------+
|[2, May, Derrick,...|[1, Carter, Derri...|[3, Mills, Jeff, 80]|
+--------------------+--------------------+--------------------+
-Many motif queries are stateless and simple to express, as in the examples above.
The next examples demonstrate more complex queries which carry state along a path in the motif.
combining GraphFrame motif finding with filters on the result, where the filters use sequence operations to construct a series of DataFrame Columns.
For example, suppose one wishes to identify a chain of 4 vertices with some property defined by a sequence of functions. That is, among chains of 4 vertices a->b->c->d, identify the subset of chains matching this complex filter:
1.Initialize state on path.
2.Update state based on vertex a.
- Update state based on vertex b.
4.Etc. for c and d.
If final state matches some condition, then the chain is accepted by the filter.
>>> from pyspark.sql.functions import col, lit, when
>>> from pyspark.sql.types import IntegerType
>>> chain4 = g.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)")
Query on sequence, with state (cnt)
(a) Define method for updating state given the next element of the motif.
sumFriends =\
lambda cnt,relationship: when(relationship == "friend", cnt+1).otherwise(cnt)
(b) Use sequence operation to apply method to sequence of elements in motif.
In this case, the elements are the 3 edges.
condition =\
reduce(lambda cnt,e: sumFriends(cnt, col(e).relationship), ["ab", "bc", "cd"], lit(0))
(c) Apply filter to DataFrame.
>>> chainWith2Friends2 = chain4.where(condition >= 2)
>>> chainWith2Friends2.show()
+---------------+--------------+--------------+--------------+--------------+--------------+----------------+
| a| ab| b| bc| c| cd| d|
+---------------+--------------+--------------+--------------+--------------+--------------+----------------+
| [d, David, 29]|[d, a, friend]|[a, Alice, 34]|[a, b, friend]| [b, Bob, 36]|[b, c, follow]|[c, Charlie, 30]|
|[e, Esther, 32]|[e, d, friend]|[d, David, 29]|[d, a, friend]|[a, Alice, 34]|[a, b, friend]| [b, Bob, 36]|
+---------------+--------------+--------------+--------------+--------------+--------------+----------------+
Example: subgraph()
an edge triplet (edge, src vertex, and dst vertex, plus attributes) and allows the user to select a subgraph based on triplet and vertex filters.
There are three helper methods for subgraph selection. filterVertices(condition), filterEdges(condition), and dropIsolatedVertices().
-Select subgraph of users older than 30, and relationships of type "friend".
-Drop isolated vertices (users) which are not contained in any edges (relationships).
>>> g1 = g.filterVertices("age > 30").filterEdges("relationship = 'friend'").dropIsolatedVertices()
>>> g1.edges.show()
+---+---+------------+
|src|dst|relationship|
+---+---+------------+
| a| b| friend|
+---+---+------------+
Select subgraph based on edges "e" of type "follow"
pointing from a younger user "a" to an older user "b".
paths = g.find("(a)-[e]->(b)")\
.filter("e.relationship = 'follow'")\
.filter("a.age < b.age")
"paths" contains vertex info. Extract the edges.
e2 = paths.select("e.src", "e.dst", "e.relationship")
In Spark 1.5+, the user may simplify this call:
val e2 = paths.select("e.*")
Construct the subgraph
g2 = GraphFrame(g.vertices, e2)
g2.edges.show()
+---+---+------------+
|src|dst|relationship|
+---+---+------------+
| e| f| follow|
| c| b| follow|
+---+---+------------+
Example: Breadth first search
The beginning and end vertices are specified as Spark DataFrame expressions.
Search from "Esther" for users of age < 32.
>>> paths = g.bfs("name = 'Esther'", "age < 32")
>>> paths.show()
+---------------+--------------+--------------+
| from| e0| to|
+---------------+--------------+--------------+
|[e, Esther, 32]|[e, d, friend]|[d, David, 29]|
+---------------+--------------+--------------+
Specify edge filters or max path lengths.
>>> path2 = g.bfs("name = 'Esther'", "age < 32",edgeFilter="relationship != 'friend'", maxPathLength=3)
>>> path2.show()
+---------------+--------------+--------------+--------------+----------------+
| from| e0| v1| e1| to|
+---------------+--------------+--------------+--------------+----------------+
|[e, Esther, 32]|[e, f, follow]|[f, Fanny, 36]|[f, c, follow]|[c, Charlie, 30]|
+---------------+--------------+--------------+--------------+----------------+
Example: (Strongly) connected components
NOTE: With GraphFrames 0.3.0 and later releases, the default Connected Components algorithm requires setting a Spark checkpoint directory. Users can revert to the old algorithm using connectedComponents.setAlgorithm("graphx").
>>> from pyspark import SparkContext
>>> from pyspark.streaming import StreamingContext
>>> ssc = StreamingContext(sc, 1)
>>> ssc.checkpoint("/tmp/spark")
>>> result = g.connectedComponents()
>>> result.select("id", "component").orderBy("component").show()
+---+------------+
| id| component|
+---+------------+
| b|412316860416|
| c|412316860416|
| d|412316860416|
| a|412316860416|
| e|412316860416|
| f|412316860416|
+---+------------+
>>> result.select("id", "component").orderBy("component").show()
+---+-------------+
| id| component|
+---+-------------+
| f| 412316860416|
| e| 670014898176|
| d| 807453851648|
| b|1047972020224|
| c|1047972020224|
| a|1460288880640|
+---+-------------+
Example: LPA
static Label Propagation Algorithm for detecting communities in networks.
Each node in the network is initially assigned to its own community. At every superstep, nodes send their community affiliation to all neighbors and update their state to the mode community affiliation of incoming messages.
LPA is a standard community detection algorithm for graphs. It is very inexpensive computationally, although (1) convergence is not guaranteed and (2) one can end up with trivial solutions (all nodes are identified into a single community).
>>> result = g.labelPropagation(maxIter=5)
>>> result.select("id", "label").show()
+---+-------------+
| id| label|
+---+-------------+
| b|1047972020224|
| e| 412316860416|
| a|1382979469312|
| f| 670014898176|
| d| 670014898176|
| c|1382979469312|
+---+-------------+
Example: Shortest path
Computes shortest paths from each vertex to the given set of landmark vertices, where landmarks are specified by vertex ID.
>>> results = g.shortestPaths(landmarks=["a", "d"])
>>> results.select("id", "distances").show()
+---+----------------+
| id| distances|
+---+----------------+
| b| []|
| e|[d -> 1, a -> 2]|
| a| [a -> 0]|
| f| []|
| d|[d -> 0, a -> 1]|
| c| []|
+---+----------------+
Example: saving/loading graphframe
support saving and loading to and from the same set of datasources.
>>> g.edges.write.parquet("hdfs:///user/edges")
>>> g.vertices.write.parquet("hdfs:///user/vertices")
>>> sameV = sqlContext.read.parquet("hdfs:///user/vertices")
>>> sameE = sqlContext.read.parquet("hdfs:///user/edges")
>>> sameG = GraphFrame(sameV, sameE)
Example: communications
GraphFrames provides primitives for developing graph algorithms. The two key components are:
1.aggregateMessages: Send messages between vertices, and aggregate messages for each vertex. GraphFrames provides a native aggregateMessages method implemented using DataFrame operations. This may be used analogously to the GraphX API.
- joins: Join message aggregates with the original graph. GraphFrames rely on DataFrame joins, which provide the full functionality of GraphX joins.
from pyspark.sql.functions import sum as sqlsum
from graphframes.lib import AggregateMessages as AM# For each user, sum the ages of the adjacent users.
msgToSrc = AM.dst["age"]
msgToDst = AM.src["age"]
>>> agg.show()
>>> agg = g.aggregateMessages(
sqlsum(AM.msg).alias("summedAges"),
sendToSrc=msgToSrc,
sendToDst=msgToDst)
+---+----------+
| id|summedAges|
+---+----------+
| f| 62|
| e| 65|
| d| 66|
| c| 108|
| b| 94|
| a| 65|
+---+----------+
for invoking jupyter with pyspark
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser" pyspark --py-files graphframes-0.8.2-spark3.1-s_2.12.jar --jars graphframes-0.8.2-spark3.1-s_2.12.jar
Or if using jupyter-lab
PYSPARK_DRIVER_PYTHON=jupyter-lab PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser" pyspark --py-files graphframes-0.8.2-spark3.1-s_2.12.jar --jars graphframes-0.8.2-spark3.1-s_2.12.jar
wget https://raw.githubusercontent.com/cchantra/bigdata.github.io/refs/heads/master/spark/graphframe.ipynb
You can use tunnel jupyter port and then open jupyter notebook in your web browser.
References
http://pysparktutorial.blogspot.com/2017/10/graphframes-pyspark.html