MODULE 2 ICP 6 - a190884810/Big-Data-Programming GitHub Wiki

PROBLEM STATEMENT

1.Import the dataset as a csv file and create data framesdirectly on import than create graph out of the data frame created. 2.Triangle Count 3.Find Shortest Paths w.r.t. Landmarks 4.Apply Page Rank algorithm on the dataset. 5.Save graphs generated to a file. Bonus 1.Apply Label Propagation Algorithm 2.Apply BFS algorithm

FEATURES

  • Intellij IDE, scala plugin is used for this in-class programming. GraphFrames are primarily focused in this programming session. This is done by adding appropriate dependencies in build.sbt file. Innovative and interesting algorithms such as pagerank,BFS and shortest path is performed on the given dataset.

DATASETS

CONFIGURATIONS

  • build.sbt file is modified as follows in order to add GraphFrames.

APPROACH

  • QUESTION 1 Import the dataset as a csv file and create data framesdirectly on import than create graph out of the data frame created.

  • For this, A CSV file that is already present in local machine (trips.csv and stations.csv) in this case is loaded and the dataframe is created. The following screeshot depicts the loading and creation of dataframe.

  • QUESTION 2 Triangle Count

  • Triangle count algorithm is widely used for social networking analysis. It is a community detection graph algorithm that is used to determine the number of triangles passing through each node in the graph.

  • Initially a vertex is considered and its neighbour is selected.

  • The selected neighbour neighbours are identified.

  • The intersection of these neighbours list is found out and traingles count is made.

  • QUESTION 3 Find Shortest Paths w.r.t. Landmarks
  • The shortest path is found out by using an in-built function with respect to landmarks. The following is the code snippet.

  • QUESTION 4 Apply Page Rank algorithm on the dataset.
  • Page Rank algorithm is an algorithm used by Google Search to rank websites in their search engine results. The PageRank algorithm outputs a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page.

  • QUESTION 5 Save graphs generated to a file.
  • The files which are vertices and edges are saved as follows in two different files in the form of CSV files.

  • BONUS
  • Apply BFS algorithm
  • Breadth-first search is an algorithm for traversing or searching tree or graph data structures. It starts at the tree root, and explores all of the neighbor nodes at the present depth prior to moving on to the nodes at the next depth level.