ICP_Module2_Assignment_5 - MadhuriSarode/BDP GitHub Wiki

Madhuri Sarode : 24

Graph Frames and GraphX : Distributed Collection of Data


Installation GraphFrames jar is installed and added to the project as an external jar


GraphFrames is a package for Apache Spark that provides DataFrame-based graphs. It provides high-level APIs in Java, Python, and Scala. It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames. This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graph queries. You can create GraphFrames from vertex and edge DataFrames.

Vertex DataFrame: A vertex DataFrame should contain a special column named id which specifies unique IDs for each vertex in the graph. Edge DataFrame: An edge DataFrame should contain two special columns: src (source vertex ID of edge) and dst (destination vertex ID of edge). Both DataFrames can have arbitrary other columns. Those columns can represent vertex and edge attributes.

GraphFrames provide simple graph queries, such as node degree.

Also, since GraphFrames represent graphs as pairs of vertex and edge DataFrames, it is easy to make powerful queries directly on the vertex and edge DataFrames. Those DataFrames are available as vertices and edges fields in the GraphFrame.

Build more complex relationships involving edges and vertices using motifs. For example, suppose you want to identify a chain of 4 vertices with some property defined by a sequence of functions. That is, among chains of 4 vertices a->b->c->d, identify the subset of chains matching this complex filter:

  • Initialize state on path.
  • Update state based on vertex a.
  • Update state based on vertex b.
  • Etc. for c and d. If final state matches some condition, then the filter accepts the chain.

Assignment - Part 1

1)Import the dataset as a csv file and create data frames directly on import, than create graph out of the data frame created.. The data set is imported from files 201508_trip_data.csv and 201508_station_data.csv 2)Concatenate chunks into list & convert to DataFrame :Here the latitude and longitude columns are concatenated and displayed as a single value 3)Remove duplicates. Only unique / distinct values of the column is filtered and displayed. The display() function helps in it 4)Name Columns : The columns are renamed by replacing their old names 5)Output DataFrame : The dataframes of the datasets can be visualized with the column header and data in it, having the new column names. 6)Show some vertices :The graph's vertices and edges are recorded in a csv file, 7)Show some vertices

9)Vertex in-Degree 10)Vertex out-Degree 11)Apply the motif findings.

Assignment - Bonus Question

1.Vertex degree 2.What are the most common destinations in the dataset from location to location? 3.What is the station with the highest ratio of in degrees but fewest out degrees? As in, what station acts as almost a pure trip sink? A station where trips end at but rarely start from. 4.Save graphs generated to a file.