icp5m2 - gracesyl/big-data-hadoop GitHub Wiki

Big Data Programming GraphX and GraphFrames:

•GraphXis to RDDs as GraphFramesare to DataFrames. •GraphFramesrepresent graphs: vertices (e.g., users) and edges (e.g., relationships between users). •GraphFramesare based uponSpark DataFrames. •GraphX are based upon RDDs.

pyspark commands are as follows:

1.Import the dataset as a csv file and create data framesdirectly on importthan create graph out of the data frame created. trip_data_df = spark.read.format("csv").option("header", True).option("inferSchema", True).load("/201508_trip_data.csv") station_data_df = spark.read.format("csv").option("header", True).option("inferSchema", True).load("/201508_station_data.csv").

2.Concatenate chunks into list & convert to DataFrame: station_data_df.select(concat(col("lat"), lit(" "), col("long")).alias("loc")).show(10, False);

3.Remove duplicates: station_data_df.select("dockcount").distinct().show()

4.Name Columns: print(station_data_df.columns)

5.Output DataFrame:

6.Create vertices v = station_data_df.select(col("name").alias("id"), "lat", "long") g.vertices.show(10, False)

7.Show some edges: e = trip_data_df.select(col("Start Station").alias("src"), col("End Station").alias("dst"), col("Subscriber Type").alias("relationship")) g.edges.show(10, False)

9.Vertex in-Degree& 10.Vertex out-Degree g.inDegrees.show(10, False)

g.outDegrees.show(10, False)

Degree: g.degrees.show(10, False)

References:

https://databricks.com/blog/2016/03/03/introducing-graphframes.html