GraphX计算 - Jeffrey511/Jeffrey-Yu GitHub Wiki
案例/数据源/链接/文件名/重命名文件 1.PageRank YouTube https://snap.stanford.edu/data/com-Youtube.html com-youtube.ungraph.txt page-rank-yt-data.txt
2.Connected Components LiveJournal https://snap.stanford.edu/data/com-LiveJournal.html com-lj.ungraph.txt connected-components-lj-data.txt
3.Triangle Count Facebook https://snap.stanford.edu/data/egonets-Facebook.html facebook_combined.txt triangle-count-fb-data.txt
首先,我们在YouTube在线社交网络数据上运行PageRank。该数据集包括了真实的社区信息,基本上是用户所定义的其他用户可加入的群组。
import org.apache.spark._ import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD import java.util.Calendar
val graph = GraphLoader.edgeListFile(sc, "file:///root/page-rank-yt-data.txt")
val vertexCount = graph.numVertices
val vertices = graph.vertices vertices.count()
val edgeCount = graph.numEdges
val edges = graph.edges edges.count()
val triplets = graph.triplets triplets.count() triplets.take(5)
val inDegrees = graph.inDegrees inDegrees.collect()
val outDegrees = graph.outDegrees outDegrees.collect()
val degrees = graph.degrees degrees.collect()
val staticPageRank = graph.staticPageRank(10) staticPageRank.vertices.collect()
Calendar.getInstance().getTime() val pageRank = graph.pageRank(0.001).vertices Calendar.getInstance().getTime()
println(pageRank.top(5).mkString("\n"))
下面我们来看看在LiveJournal的社交网络数据上运行Connected Components的代码。该数据集包括在网站上注册并有个人和群组博客帖子的用户。该网站还允许用户识别朋友用户。
import org.apache.spark._ import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD import java.util.Calendar
val graph = GraphLoader.edgeListFile(sc, "data/connected-components-lj-data.txt")
Calendar.getInstance().getTime() val cc = graph.connectedComponents() Calendar.getInstance().getTime()
cc.vertices.collect()
println(cc.vertices.take(5).mkString("\n"))
val scc = graph.stronglyConnectedComponents() scc.vertices.collect()
最后是在Facebook的社交圈数据上计算Triangle Counting的Spark程序,依旧用的Scala。该数据集包括Facebook上的朋友列表,信息包括user profiles,circles和ego networks。
import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD
val graph = GraphLoader.edgeListFile(sc,"file:///triangle-count-fb-data.txt")
println("Number of vertices : " + graph.vertices.count()) println("Number of edges : " + graph.edges.count())
graph.vertices.foreach(v ⇒ println(v))
val tc = graph.triangleCount()
tc.vertices.collect
println("tc: " + tc.vertices.take(5).mkString("\n"));
println("Triangle counts: " + graph.connectedComponents.triangleCount().vertices.top(5).mkString("\n"));
val sum = tc.vertices.map(a ⇒ a._2).reduce((a, b) ⇒ a + b)