ICP 12 - Murarishetti-Shiva-Kumar/Big-Data-Programming GitHub Wiki
Lesson Plan12: Graph Frames and GraphX
Lesson Plan Description: Distributed Collection of Data
Part – 1:
Create the project and update the below dependencies for graphframes in the build.sbt file.
1. Import the dataset as a csv file and create data frames directly on import than create graph out of the data frame created.
Here we are importing the two datasets stations.csv and trips.csv and we are also creating dataframes directly.
2. Concatenate chunks into list & convert to Data Frame
3. Remove duplicates
Here in order to remove the duplicate data we are using the distinct function on the dockcount column of the station dataset. The distinct function here retrives the unique values for the particular column id and removes the duplicates present in the column. And by using DropDuplicates function we are removing the duplicates in the two dataframes.
4. Name Columns
5. Output Data Frame
6. Create vertices, 7. Show some vertices, 8. Show some edges
We are renaming the columns in both the datasets. From below code we can see that the column 'name' of the station data is renamed to 'id' and from the trips table the columns 'start station' and 'end station' are renamed into 'src' and 'dest'. We are also creating vertices to the station dataset and edges to the trips dataset.
9. Vertex in-Degree
Here we are writing the code to display the in-degree in descending order with a limit 5.
10.Vertex out-Degree
Here we are writing the code to display the out-degree in descending order with a limit 5.
11.Apply the motif findings.
In this code we are finding the motif as we are writing the pattern for the subgraph we haven taken as the product of a goes to product and vice versa. We are displaying the possibilities here.
12.Apply Stateful Queries.
This code is to find the motif by carrying state along the path.
13.Subgraphs with a condition.
This is the code to retrieve all the data which has trip duration greater than 932.
Bonus
1.Vertex degree
2. What are the most common destinations in the dataset from location to location?
This is the code to display the top 10 most common destinations.
3. What is the station with the highest ratio of in degrees but fewest out degrees. As in, what station acts as almost a pure trip sink. A station where trips end at but rarely start from.
In this code we are creating the in and out degree views and selecting the indegree, outdegree and id and joining them with inclusion of ids. Then creating the view and selecting the id from the view as ordering indegree as ascending and outdegree as descending and displaying them.
4.Save graphs generated to a file.
In this the created graphs are saved into the csv file in a specified location.