ICP 10 : DataFrames & SQL in Scala Pyspark - acikgozmehmet/BigDataProgramming GitHub Wiki

ICP 10 : DataFrames & SQL in Scala/Pyspark

Lesson Overview:

• Data frames

• Construction of Data Frames

• SparkSQL

• Transformation

• Laziness

• Actions

• Basic Commands on Data frames

• Basic commands of SQL on Data frames


Spark SQL, DataFrames and Datasets Guide

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to interact with Spark SQL including SQL and the Dataset API. When computing a result the same execution engine is used, independent of which API/language you are using to express the computation. This unification means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation.

In Class Programming:

Part – 1:

  1. Import the dataset and create data frames directly on import.
  2. Save data to file.
  3. Check for Duplicate records in the dataset.
  4. Apply Union operation on the dataset and order the output by Country Name alphabetically.
  5. Use Groupby Query based on treatment.

Please click on the link to reach the source code

1. Import the dataset and create data frames directly on import.

2. Save data to file.

3. Check for Duplicate records in the dataset.

4. Apply Union operation on the dataset and order the output by Country Name alphabetically.

5. Use Groupby Query based on treatment.

Part – 2:

  1. Apply the basic queries related to Joins and aggregate functions (at least 2)
  2. Write a query to fetch 13th Row in the dataset.

1. Apply the basic queries related to Joins and aggregate functions (at least 2)

2. Write a query to fetch 13th Row in the dataset.

Part –3:(bonus)

1.Write a parseLine method to split the comma-delimited row and create a Data frame.

There are 2 approaches implemented in here.

Approach-1 (Scala Case Class):

Please click on the link to reach the source code

Approach-2 (Schema with StructType):

Please click on the link to reach the source code

References:

https://spark.apache.org/docs/latest/sql-programming-guide.html

https://stackoverflow.com/questions/51689460/select-specific-columns-from-spark-dataframe

https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-aggregate-functions.html

https://stackoverflow.com/questions/29704333/spark-load-csv-file-as-dataframe