MODULE 2 ICP 5 - a190884810/Big-Data-Programming GitHub Wiki
PROBLEM STATEMENT
1.Import the dataset as a csv file and create data framesdirectly on importthan create graph out of the data frame created. 2.Concatenate chunks into list & convert to DataFrame 3.Remove duplicates 4.Name Columns 5.Output DataFrame 6.Create vertices 7.Show some vertices 8.Show some edges 9.Vertex in-Degree 10.Vertex out-Degree 11.Apply the motif findings.
FEATURES
- Intellij IDE, scala plugin is used for this in-class programming. GraphFrames are primarily focused in this programming session. This is done by adding appropriate dependencies in build.sbt file.
CONFIGURATIONS
- build.sbt file is modified as follows in order to add GraphFrames.
APPROACH
-
QUESTION 1 Import the dataset as a csv file and create data framesdirectly on importthan create graph out of the data frame created.
-
For this, A CSV file that is already present in local machine (trips.csv and stations.csv) in this case is loaded and the dataframe is created. The following screeshot depicts the loading and creation of dataframe.
-
QUESTION 2 Concatenate chunks into list & convert to DataFrame
-
Two columns of stations.csv file which are lat and long are concatenated and the results are displayed as follows.
-
QUESTION 3 Remove duplicates
-
Duplicate values are removed by using distinct() while creating vertices. This helps in removing the duplicate values while creating a graph.
-
QUESTION 4 Name Columns
-
The columns are renamed to src and dest. The following screenshots depicts the snippet.
-
QUESTION 5 Output DataFrame
-
The dataframes are displayed by using show() function as follows.
-
QUESTION 6 Create vertices
-
The vertices are created as follows by initially removing the duplicates and with station.csv file.
-
QUESTION 7 SHOW VERTICES
-
The vertices are displayes by using show() function as follows.
-
QUESTION 8 Show some edges
-
The edges are shown as follows.
- QUESTION 9 Vertex in-Degree
- The In-Degree of refers to the number of arcs incident to . That is, the number of arcs directed towards the vertex. The in-degrees are displayed as follows
- QUESTION 10 Vertex out-Degree
- For a directed graph and a vertex , the Out-Degree of refers to the number of arcs incident from . That is, the number of arcs directed away from the vertex .
-
QUESTION 11 Apply the motif findings.
-
Network motifs are sub-graphs that repeat themselves in a specific network or even among various networks. Each of these sub-graphs, defined by a particular pattern of interactions between vertices, may reflect a framework in which particular functions are achieved efficiently.
BONUS
1. Vertex degree
- To display the combination of in-degree and out-degree of each and every vertex, the following code snippet is used. The output follows.
2.What are the most common destinations in the dataset from location to location
- To display the most common destinations used, the following query is used that displays top three common destinations.
3.What is the station with the highest ratio of in degrees but fewest out degrees. As in, what station acts as almost a pure trip sink. A station where trips end at but rarely start from.
- To get this result, the in and out degree tables are initially joined and order by is performed twice to get vertex with a maximum out-degree and minimum in-degree value.
4.Save graphs generated to a file.
- The following screenshot depicts the way the graphs are stored as files on local machine.