ICP_10 - PallaviArikatla/Big-Data-Programming GitHub Wiki
INTRODUCTION: Working on Data frame and SQL in Spark.
IMPLEMENTATION:
Question 1:
1) Import the dataset and create data frames directly on import.
- Import the dataset given and read the content in the file.
2) Save the data to a file.
- Have to read the content in the file and store it in another file.
- File will be created in the folder.
3) Check for Duplicate records in the dataset.
- Check for the duplicate files and drop if any followed by its verification.
- Count the output after dropping nulls will be as follows.
4) Apply Union operation on the dataset and order the output by CountryName alphabetically.
- Create two tables for male and female based on header gender and later merge them using union operation.
- Output:
5) Use Groupby Query based on treatment.
- Output obtained will be as follows:
Question 2:
1)Apply the basic queries related to Joins and aggregate functions (at least 2) JOIN QUERY:
- Select content from both male and female tables joined by country.
"select m.age,m.Country,m.Gender, m.treatment,f.Gender,f.treatment from Table_Male m join Table_Female f on m.Country = f.Country"
Output:
Aggregate Function:
Calculate sum of ages of all the people in the dataframe.
"select sum(Age),count(Gender) from Survey"
- Output:
Join query:
"select m.age,m.Country,m.Gender, m.treatment,f.Gender,f.treatment from Table_Male m left join Table_Female f on m.State = f.State"
2) Write a query to fetch 13th Row in the dataset.
-
Identifies 13th row content and displays it.
-
Output: