ICP 10 - a190884810/Big-Data-Programming GitHub Wiki

MODULE 2 ICP 3

PROBLEM STATEMENT

Part–1: 1.Import the dataset and create data frames directly on import. 2.Save data to file. 3.Check for Duplicate records in the dataset. 4.Apply Union operation on the dataset and order the output by CountryName alphabetically. 5.Use Groupby Query based on treatment

Part-2: 1.Apply the basic queries related to Joins and aggregate functions (at least 2) 2.Write a query to fetch 13th Row in the dataset.

FEATURES

Intellij IDE with Scala plugin is used in order to execute SQL Context queries. Various queries using joins,group by operations and many more are performed.

APPROACH

PART-1

A dataset(survey.csv) that is provided is used and data frame is created. The following snippet depicts how the data frame is created and is loaded.

The loaded file is saved(The CSV file) into the local system by using save() function.

The file is successfully loaded into the folder 'saved' in the local machine

Duplicates can be identified by using count(*) > 1 in the having clause. Following is the query screenshot along with output. Fortunately, There are no duplicate records in the data set.

A union operation is performed by initially dividing the data into two sets, One with care_options 'yes' and the other with care_options 'no'. Later a union operation is performed and the dataset is arranged based on country's name. The following shows the snippet of query used.

The following is the screenshot of the output generated.

The following is a query used to perform group by operation by using treatment column.
The following is the output that is generated.

PART-2

To perform join operation, I have imported the same dataset twice. Now, Choosing one column from each table, join operation is performed based on the country column.

The following is the output generated.
Two aggregate functions(Count and Sum) are used on the Age which is a numeric column. The following screenshots portrays queries as well as their outputs.

OUTPUTS

To get one particular row(13), take() function can be used by specifying the number of the row from which the data needs to be fetched.
The following is the output generated.

CONFIGURATIONS

Few modifications has been performed on build.sbt file. Library dependencies have been added. The reason for this is to enable smooth loading and saving of data.