ICP 13 - awais546/Big-Data-Programming-Hadoop-Pyspark GitHub Wiki

Big Data Prgramming Hadoop/Pyspark

GraphFrames

Introduction

This ICP is the continuation of the previous ICP-12 of GraphFrames. In this lab we extended our functionalities and implemented some complex algorithms using graphframes.

Tasks

The tasks to perform in this lab are as follows.

1. Import the dataset as a csv file and create data framesdirectly on import than create graph out of the data frame created.

To load the csv files in the data frame use the following command.

df_station = sqlContext.read.csv("Datasets\201508_station_Data.csv", inferSchema = True, header = True)

df_trip = sqlContext.read.csv("Datasets\201508_trip_data.csv", inferSchema = True, header = True)

To make a graphframe for the above made dataframes use the following command.

g2 = GraphFrame(df_station, df_trip)

2. Triangle Count

To use the Triangle Count algorithm use the following command.

result = g3.triangleCount()

3. Find Shortest Paths w.r.t. Landmarks

In order to use this algorithm make sure that all data type of 'id', 'src' and 'dst' is string.

temp1 = df_station.withColumn("id", col("id").cast("string"))

The schema is as follows.

The output is as follows.

4. Apply Page Rank algorithm on the dataset.

Use the following command for Page Rank.

result3 = g.pageRank(resetProbability=0.15,tol=0.01)

Page rank gives us a graphframe.

The vertices are as follows.

The edges are as follows.

5. Save graphs generated to a file.

In order to save the graphframes we have to save the vertices and edges seperately.

g3.edges.write.parquet('saved_vertices')

g3.edges.write.parquet('saved_edges')

Bonuse Questions

1. Apply Label Propagation Algorithm

Use the following command for label propagation.

result4 = g.labelPropagation(maxIter=5)

2. Apply BFS algorithm

In order to use BFS make sure the filter column names do not have spaces in it.