ICP 12 - manaswinivedula/Big-Data-Programming GitHub Wiki
Spark Graph Frames and Graph X
Initial installations of libraries
- Initially downloaded the graph frame package using the spark-shell as shown below
- Then adding these libraries in build.sbt file
Task1
1.Import the dataset as a CSV file and create data frames directly on import and creating a graph out of the data frame created.
- The following is the source code
- The following is the output of the data frames.
2.Concatenate chunks latitude and longitudes into list & converting them into a single Data Frame.
- The following is the source code
- The following is the output of the concatenated data frames
3.Remove duplicates from the data frames.
- The following is the source code
4. Naming the column names of the trip data frame
- The following is the source code
5.Displaying the Output Data Frame.
- The following is the source code
- The following is the output of the removed duplicates, changed column names, and finally displaying them as output.
6. Creating vertices, edges and finally using graph framework creating a graph g using the vertices, edges.
- The following is the source code
- The following is the output of the vertices
7.Displaying some vertices from graph g.
-
The following is the source code
-
The following is the output of the vertices
8.Displaying some edges from graph g.
- The following is the source code
- The following is the output of the edges
9. Displaying vertex in Degree from graph g.
- The following is the source code
- The following is the output of the vertex in Degree
10.Displaying vertex out-degree from graph g.
- The following is the source code
- The following is the output of the edges
11.Applying motif findings on graph g. That means finding the repetitive subgraph from the main graph.
-
The following is the source code
-
The following is the output of the motif findings.
Bonus
1. Finding the Vertex degree
- The following is the source code
- The following is the output of the vertex degree
2. Finding the most common destinations in the dataset from location to location
- The following is the source code
- The following is the output of the most common destinations
3. Finding the station with the highest ratio of in degrees but fewest out degrees.
- The following is the source code
- The following is the output of the station with the highest ratio in degrees.
4. Saving the graph to the output file.
- The following is the source code
- The following are the generated output files for vertices and edges.