Big_Data_Programming_ICP_3_Module2 - kusamdinesh/Big-Data-and-Hadoop GitHub Wiki

SparkSQL

Spark SQL is a relational query engine built on top of Spark Core. Spark SQL uses a query optimizer called Catalyst. Relational queries can be expressed using SQL or HiveQL and executed against JSON, CSV , and various databases. Spark SQL gives us the full expressiveness of declarative programing with Spark dataframes on top of functional programming with RDDs.

Part 1:

Task1: Importing a dataset and creating data frames on import.

  • Here we are creating a dataframe by name 'ds' and importing/loading the dataset by name 'Survey.csv' into the dataframe.

  • We are printing/showing the dataframe to check if the dataset is loaded properly or not.

Task2: Saving the data to a file.

  • Here we are saving the data to a file by name 'Survey1.csv' which will be stored in HDFS.

Task3: Checking for duplicate records in the data.

  • We are creating a new dataframe by name 'dsdub' and initializing the dataframe with distinct records from the dataframe 'ds'.
  • Creating a registerTempTable for each dataframe which registers a DataFrame as a Temporary Table in the SQLContext, upon which we can execute SQL queries on these temporary tables.
  • We are creating a new dataframe by name 'minu', on which we are performing SQL query to remove duplicates and show the distinct records.

Output:

Task4: Apply the Union operation on the dataset and order the output by Country.

Output:

Task5: Use GroupBy Query based on treatment.

Output:

Part 2:

Task1: Applying joins and two aggregate functions.

  • Here we are performing left join and join operations dsdup and also performing the two aggregate functions MAX and AVG.

Output:

Task2: Query to fetch the 13th row.

Output:

Bonus :

Procedure :

Write a parseLine method to split the comma-delimited row and create a Data frame. We use the user defined fucntion 'parseLine' in order to split the comma-delimited row and then create a dataframe.

Output :