GraphX in Spark - awantik/spark GitHub Wiki
###What is a graph?
- Fantastic way to represent information
- They are not line charts but network of interconnected devices
- The cool thing here is, they represent both tangible & abstract things.
- Example of social network, each person is connected by some sort of relationship be it friendship or something.
- Image each person in social network as an RDD.
- DAG - Directed Acyclic Graph
Nodes Vertices - They are typical things like people or places Edges - They are lines that connect the nodes/vertices Weights - Some sort of strength to the edge Directed - They have direction Undirected - They don't have any direction info - friendship. Cyclic - You have multiple paths to reach nodes Acyclic - Cannot reach starting point
###Basics of GraphX
- Built around graph theory
- Provides Spark API for graphs - web-graphs & social networks
- Provides Spark API for graph-parallel computation - PageRank, Recommandation
- GraphX extends the Spark RDD abstraction using Resilient Distributed Property Graph - Directed multigraph with properties to each vertex & edge.
- GraphX supports fundamental operators like subgraph, joinVertices, mapReduceTriplets.
- GraphX library is a collection of algorithms for graph analytics.
###Data Parallel - Hadoop & Spark, They break down the entire data int blocks. And, parallel computation is happening on all the blocks.
###Graph Parallel Computation - Things like social networking etc has driven the development of numerous new graph parallel system.
Raw Data </> -> Creation of initial Graph (ETL) -> (Slice) the graph here, creation of subgraph -> Compute the nodes/vertices ( compute pagerank ) -> Analyze ( Using HIVE, find top users ) -> Repeat stages from ETL
** The vision of GraphX project is to unify data-parallel & graph-parallel computation resulting a single API. **
- Directed multigraph - A directed graph with multiple parallel edges sharing the same source destination vertex.
- Each vertex is keyed by a unique 64-bit long identifier