Module 2: ICP #3 - VidyullathaKaza/BigData_Programming_Spring2020 GitHub Wiki
DataFrames & SQL in Scala/Pyspark
Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed.There are several ways to interact with Spark SQL including SQL and the Dataset API. When computing a result the same execution engine is used, independent of which API/language you are using to express the computation.
One use of Spark SQL is to execute SQL queries. Spark SQL can also be used to read data from an existing Hive installation.When running SQL from within another programming language the results will be returned as a Dataset/DataFrame. You can also interact with the SQL interface using the command-line or over JDBC/ODBC.
A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
Part-1:
1.Import the dataset and create data frames directly on import. The survey.csv file is the dataset and it is as follows
The dataset is read using spark.read command along with the header and the data from the file whose path is specified is loaded onto the dataframe.
The output is as follows
2.Save data to file.
The output is as follows. The output file is also written in csv format and saved in the location specified.
3.Check for Duplicate records in the dataset.
The output is as follows
4.Apply Union operation on the dataset and order the output by CountryName alphabetically.
The output is as follows
5.Use Groupby Query based on treatment.
The output is as follows
Part-2:
1.Apply the basic queries related to Joins and aggregate functions (at least 2)
The output is as follows
2.Write a query to fetch 13th Row in the dataset.
The output is as follows
Part –3:(bonus)
1.Write a parseLine method to split the comma-delimited row and create a Data frame.
The output is as follows