Big_Data_Programming_ICP_5_Module2 - kusamdinesh/Big-Data-and-Hadoop GitHub Wiki

Procedure :

  1. Import the dataset

Import the dataset as a csv file and create data frames directly on import, then create graph out of the data frame created

Input :

  1. Concatenate chunks into list & convert to DataFrame

I performed concatenation of two columns named 'lat','lan' from the Stations dataframe.

Input :

Output :

3.Remove duplicates

The 'distinct' command is used to check for any duplicate values in the dataframes.

Input :

Output :

  1. Name Columns, Output DataFrame, Create vertices

5.Show some vertices

Input:

Output :

  1. Show some edges

Input :

Output :

  1. Vertex in-Degree

Input :

Output :

  1. Vertex out-Degree

Input :

Output :

  1. Apply the motif findings

The motif findings is nothing but finding the sub-graphs which can be traversed in either ways. Here, the pattern which is considered is 'a to b' and 'b to a'.

Input :

Output :

Bonus:

Vertex degree

Input :

Output :

  1. What are the most common destinations in the dataset from location to location

This is being done using the groupby function and set the limit to 10 inorder to display the top 10 common destinations.

Input :

Output :

  1. What is the station with the highest ratio of in degrees but fewest out degrees.

As in, what station acts as almost a pure trip sink. A station where trips end at but rarely start from. This is being implemented using the join operation.

Input :

Output :

  1. Save graphs generated to a file

The graphs that are generated for both the vertices and the edges are to be stored in a seperate vertices and edges folder.

Input :

Output :

References :

https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-scala.html#motif-finding