MODULE 2 ICP 5 - navyagonug/CS5590-BIG-DATA-PROGRAMMING-USING-HADOOP-AND-SPARK GitHub Wiki
PROBLEM STATEMENT
1.Import the dataset as a csv file and create data framesdirectly on importthan create graph out of the data frame created. 2.Concatenate chunks into list & convert to DataFrame 3.Remove duplicates 4.Name Columns 5.Output DataFrame 6.Create vertices 7.Show some vertices 8.Show some edges 9.Vertex in-Degree 10.Vertex out-Degree 11.Apply the motif findings.
FEATURES
Intellij IDE, scala plugin is used for this in-class programming. GraphFrames are primarily focused in this programming session. This is done by adding appropriate dependencies in build.sbt file.
CONFIGURATIONS
build.sbt file is modified as follows in order to add GraphFrames.
APPROACH
QUESTION 1 Import the dataset as a csv file and create data framesdirectly on importthan create graph out of the data frame created.
For this, A CSV file that is already present in local machine (trips.csv and stations.csv) in this case is loaded and the dataframe is created. The following screeshot depicts the loading and creation of dataframe.
QUESTION 2 Concatenate chunks into list & convert to DataFrame
Two columns of stations.csv file which are lat and long are concatenated and the results are displayed as follows.
QUESTION 3 Remove duplicates
Duplicate values are removed by using distinct() while creating vertices. This helps in removing the duplicate values while creating a graph.
QUESTION 4 Name Columns
The columns are renamed to src and dest. The following screenshots depicts the snippet.
QUESTION 5 Output DataFrame
The dataframes are displayed by using show() function as follows.
QUESTION 6 Create vertices
The vertices are created as follows by initially removing the duplicates and with station.csv file.
QUESTION 7 SHOW VERTICES
The vertices are displayes by using show() function as follows.
QUESTION 8 Show some edges
The edges are shown as follows.
QUESTION 9 Vertex in-Degree
The In-Degree of refers to the number of arcs incident to . That is, the number of arcs directed towards the vertex. The in-degrees are displayed as follows
QUESTION 10 Vertex out-Degree
For a directed graph and a vertex , the Out-Degree of refers to the number of arcs incident from . That is, the number of arcs directed away from the vertex .
QUESTION 11 Apply the motif findings.
Network motifs are sub-graphs that repeat themselves in a specific network or even among various networks. Each of these sub-graphs, defined by a particular pattern of interactions between vertices, may reflect a framework in which particular functions are achieved efficiently.
BONUS
1. Vertex degree
To display the combination of in-degree and out-degree of each and every vertex, the following code snippet is used. The output follows.
2.What are the most common destinations in the dataset from location to location
To display the most common destinations used, the following query is used that displays top three common destinations.
3.What is the station with the highest ratio of in degrees but fewest out degrees. As in, what station acts as almost a pure trip sink. A station where trips end at but rarely start from.
To get this result, the in and out degree tables are initially joined and order by is performed twice to get vertex with a maximum out-degree and minimum in-degree value.
4.Save graphs generated to a file.
The following screenshot depicts the way the graphs are stored as files on local machine.