GraphX in Spark - awantik/spark GitHub Wiki

###What is a graph?

  • Fantastic way to represent information
  • They are not line charts but network of interconnected devices
  • The cool thing here is, they represent both tangible & abstract things.
  • Example of social network, each person is connected by some sort of relationship be it friendship or something.
  • Image each person in social network as an RDD.
  • DAG - Directed Acyclic Graph

Nodes Vertices - They are typical things like people or places Edges - They are lines that connect the nodes/vertices Weights - Some sort of strength to the edge Directed - They have direction Undirected - They don't have any direction info - friendship. Cyclic - You have multiple paths to reach nodes Acyclic - Cannot reach starting point

###Basics of GraphX

  • Built around graph theory
  • Provides Spark API for graphs - web-graphs & social networks
  • Provides Spark API for graph-parallel computation - PageRank, Recommandation
  • GraphX extends the Spark RDD abstraction using Resilient Distributed Property Graph - Directed multigraph with properties to each vertex & edge.
  • GraphX supports fundamental operators like subgraph, joinVertices, mapReduceTriplets.
  • GraphX library is a collection of algorithms for graph analytics.

###Data Parallel - Hadoop & Spark, They break down the entire data int blocks. And, parallel computation is happening on all the blocks.

###Graph Parallel Computation - Things like social networking etc has driven the development of numerous new graph parallel system.

Raw Data </> -> Creation of initial Graph (ETL) -> (Slice) the graph here, creation of subgraph -> Compute the nodes/vertices ( compute pagerank ) -> Analyze ( Using HIVE, find top users ) -> Repeat stages from ETL

** The vision of GraphX project is to unify data-parallel & graph-parallel computation resulting a single API. **

GraphX API

The Property Graph

  • Directed multigraph - A directed graph with multiple parallel edges sharing the same source destination vertex.
  • Each vertex is keyed by a unique 64-bit long identifier
⚠️ **GitHub.com Fallback** ⚠️