ICP 10 - manaswinivedula/Big-Data-Programming GitHub Wiki

Updated dependencies

These are the updated dependencies in the build.sbt file

Input file

The following is the input file that has been used for this ICP

Task 1

Importing the dataset and create a data frame directly on import and displaying the 20 rows.

The following is the output

Saving the data into the directory output

The following is the output

Displaying the duplicate records. Initially checking the count of total records and distinct records and finally displaying the duplicate records if exists

The following is the output for the above query

4.divided the benefits column into two separate columns and then performing union on those columns and performing order by operation based on country column.

The following is the output of the above query

performing group operation on the column treatment.

The following is the output of the above query

Task 2

Performing inner join operation on two columns according to their age.

The following is the output of the above query.

1.1. Performing order by on country according to the age

The following is the output

1.2 performing the count on the column countries and performing the average age for each country

The following is the output for performing count on countries

The following is the output for the average age for each country

Displaying the 13th row in the dataset

The following is the output

Bonus

Parsing the file with the help of "," operator using ParseLine method and displaying the output

The following is the output

Performing the label encoding on the column gender.

The following is the output

Performing correlation on the age and label encoded gender column

The following is the output

Performing covariance on the age and label encoded gender column

The following is the output

References:

http://spark.apache.org/docs/latest/sql-programming-guide.html

2.http://spark.apache.org/docs/1.6.3/api/java/org/apache/spark/sql/DataFrameStatFunctions.html

3.https://spark.apache.org/docs/0.9.1/scala-programming-guide.html