ICP_10 - PallaviArikatla/Big-Data-Programming GitHub Wiki

INTRODUCTION: Working on Data frame and SQL in Spark.

1) Import the dataset and create data frames directly on import.

2) Save the data to a file.

3) Check for Duplicate records in the dataset.

4) Apply Union operation on the dataset and order the output by CountryName alphabetically.

Create two tables for male and female based on header gender and later merge them using union operation.

5) Use Groupby Query based on treatment.

1)Apply the basic queries related to Joins and aggregate functions (at least 2) JOIN QUERY:

"select m.age,m.Country,m.Gender, m.treatment,f.Gender,f.treatment from Table_Male m join Table_Female f on m.Country = f.Country"

Output:

Aggregate Function:

Calculate sum of ages of all the people in the dataframe.

"select sum(Age),count(Gender) from Survey"

Join query:

"select m.age,m.Country,m.Gender, m.treatment,f.Gender,f.treatment from Table_Male m left join Table_Female f on m.State = f.State"

2) Write a query to fetch 13th Row in the dataset.