ICP_10 - PallaviArikatla/Big-Data-Programming GitHub Wiki

INTRODUCTION: Working on Data frame and SQL in Spark.

IMPLEMENTATION:

Question 1:

1) Import the dataset and create data frames directly on import.

  • Import the dataset given and read the content in the file.

2) Save the data to a file.

  • Have to read the content in the file and store it in another file.

  • File will be created in the folder.

3) Check for Duplicate records in the dataset.

  • Check for the duplicate files and drop if any followed by its verification.

  • Count the output after dropping nulls will be as follows.

4) Apply Union operation on the dataset and order the output by CountryName alphabetically.

  • Create two tables for male and female based on header gender and later merge them using union operation.

  • Output:

5) Use Groupby Query based on treatment.

  • Output obtained will be as follows:

Question 2:

1)Apply the basic queries related to Joins and aggregate functions (at least 2) JOIN QUERY:

  • Select content from both male and female tables joined by country.

"select m.age,m.Country,m.Gender, m.treatment,f.Gender,f.treatment from Table_Male m join Table_Female f on m.Country = f.Country"

Output:

Aggregate Function:

Calculate sum of ages of all the people in the dataframe.

"select sum(Age),count(Gender) from Survey"

  • Output:

Join query:

"select m.age,m.Country,m.Gender, m.treatment,f.Gender,f.treatment from Table_Male m left join Table_Female f on m.State = f.State"

2) Write a query to fetch 13th Row in the dataset.

  • Identifies 13th row content and displays it.

  • Output: