ICP 12 - Murarishetti-Shiva-Kumar/Big-Data-Programming GitHub Wiki

Lesson Plan12: Graph Frames and GraphX

Lesson Plan Description: Distributed Collection of Data

Part – 1:

Create the project and update the below dependencies for graphframes in the build.sbt file.

image

1. Import the dataset as a csv file and create data frames directly on import than create graph out of the data frame created.

Here we are importing the two datasets stations.csv and trips.csv and we are also creating dataframes directly.

image

image

image

2. Concatenate chunks into list & convert to Data Frame

image

image

3. Remove duplicates

Here in order to remove the duplicate data we are using the distinct function on the dockcount column of the station dataset. The distinct function here retrives the unique values for the particular column id and removes the duplicates present in the column. And by using DropDuplicates function we are removing the duplicates in the two dataframes.

image

image

4. Name Columns

image

image

image

5. Output Data Frame

image

image

6. Create vertices, 7. Show some vertices, 8. Show some edges

We are renaming the columns in both the datasets. From below code we can see that the column 'name' of the station data is renamed to 'id' and from the trips table the columns 'start station' and 'end station' are renamed into 'src' and 'dest'. We are also creating vertices to the station dataset and edges to the trips dataset.

image

image

image

9. Vertex in-Degree

Here we are writing the code to display the in-degree in descending order with a limit 5.

image

image

10.Vertex out-Degree

Here we are writing the code to display the out-degree in descending order with a limit 5.

image

image

11.Apply the motif findings.

In this code we are finding the motif as we are writing the pattern for the subgraph we haven taken as the product of a goes to product and vice versa. We are displaying the possibilities here.

image

image

12.Apply Stateful Queries.

This code is to find the motif by carrying state along the path.

image

image

13.Subgraphs with a condition.

This is the code to retrieve all the data which has trip duration greater than 932.

image

image

Bonus

1.Vertex degree

image

image

2. What are the most common destinations in the dataset from location to location?

This is the code to display the top 10 most common destinations.

image

image

3. What is the station with the highest ratio of in degrees but fewest out degrees. As in, what station acts as almost a pure trip sink. A station where trips end at but rarely start from.

In this code we are creating the in and out degree views and selecting the indegree, outdegree and id and joining them with inclusion of ids. Then creating the view and selecting the id from the view as ordering indegree as ascending and outdegree as descending and displaying them.

image

image

4.Save graphs generated to a file.

In this the created graphs are saved into the csv file in a specified location.

image

image