M2 ICP 3 - PavankumarManchala/BigDataProgrammingICPs GitHub Wiki

Submitted By:

Pavankumar Manchala

Class Id: 16

Tasks:

Task 1:

Import the dataset and create data frames directly on import. 2.Save data to file. 3.Check for Duplicate records in the dataset. 4.Apply Union operation on the dataset and order the output by CountryName alphabetically. 5.Use Groupby Query based on treatment.

---> Imported the Survey.csv dataset and created data frames directly on import. Then saved the data to a output file. --->Verified the duplicates records in the data set and removed the duplicates using the dropDuplicates method.

Output of data set:

---> Here retrieved some records of dataframe into two seperate data frames to perform the Union operation. ---> Counted the total number of records present in data set.

The output is here:

---> Displayed the count of records which are from same state using the GroupBy query on state field.

Output:

Task 2:

Apply the basic queries related to Joins and aggregate functions (at least 2) 2.Write a query to fetch 13th Row in the dataset.

---> Performed the Join operation by taking the data into two data frames. Joined using the Timestamp field.

Output:

---> Displayed the Maximum and Average values present in the Age column by using the aggregate functions in SQL.

---> Displayed the 13th row in the data.

Output:

Bonus:

Write a parseLine method to split the comma-delimited row and create a Data frame.

---> Defined ParseLine method which splits the fields by comma separator and converts the integer values to strings using .toString() method.

The dataset is mapped and converted to Data Frame and plotted the values.

Output:

In the above method, I displayed the sample fields of the data set.