ICP 12 - manaswinivedula/Big-Data-Programming GitHub Wiki

Spark Graph Frames and Graph X

Initial installations of libraries

  • Initially downloaded the graph frame package using the spark-shell as shown below

  • Then adding these libraries in build.sbt file

Task1

1.Import the dataset as a CSV file and create data frames directly on import and creating a graph out of the data frame created.

  • The following is the source code

  • The following is the output of the data frames.

2.Concatenate chunks latitude and longitudes into list & converting them into a single Data Frame.

  • The following is the source code

  • The following is the output of the concatenated data frames

3.Remove duplicates from the data frames.

  • The following is the source code

4. Naming the column names of the trip data frame

  • The following is the source code

5.Displaying the Output Data Frame.

  • The following is the source code

  • The following is the output of the removed duplicates, changed column names, and finally displaying them as output.

6. Creating vertices, edges and finally using graph framework creating a graph g using the vertices, edges.

  • The following is the source code

  • The following is the output of the vertices

7.Displaying some vertices from graph g.

  • The following is the source code

  • The following is the output of the vertices

8.Displaying some edges from graph g.

  • The following is the source code

  • The following is the output of the edges

9. Displaying vertex in Degree from graph g.

  • The following is the source code

  • The following is the output of the vertex in Degree

10.Displaying vertex out-degree from graph g.

  • The following is the source code

  • The following is the output of the edges

11.Applying motif findings on graph g. That means finding the repetitive subgraph from the main graph.

  • The following is the source code

  • The following is the output of the motif findings.

Bonus

1. Finding the Vertex degree

  • The following is the source code

  • The following is the output of the vertex degree

2. Finding the most common destinations in the dataset from location to location

  • The following is the source code

  • The following is the output of the most common destinations

3. Finding the station with the highest ratio of in degrees but fewest out degrees.

  • The following is the source code

  • The following is the output of the station with the highest ratio in degrees.

4. Saving the graph to the output file.

  • The following is the source code

  • The following are the generated output files for vertices and edges.

References:

  1. https://spark.apache.org/docs/latest/graphx-programming-guide.html
  2. https://www.edureka.co/blog/spark-graphx/
  3. https://graphframes.github.io/user-guide.html