ICP 10 - manaswinivedula/Big-Data-Programming GitHub Wiki
Updated dependencies
- These are the updated dependencies in the build.sbt file
Input file
The following is the input file that has been used for this ICP
Task 1
- Importing the dataset and create a data frame directly on import and displaying the 20 rows.
The following is the output
- Saving the data into the directory output
The following is the output
- Displaying the duplicate records. Initially checking the count of total records and distinct records and finally displaying the duplicate records if exists
The following is the output for the above query
4.divided the benefits column into two separate columns and then performing union on those columns and performing order by operation based on country column.
The following is the output of the above query
- performing group operation on the column treatment.
The following is the output of the above query
Task 2
- Performing inner join operation on two columns according to their age.
The following is the output of the above query.
1.1. Performing order by on country according to the age
The following is the output
1.2 performing the count on the column countries and performing the average age for each country
The following is the output for performing count on countries
The following is the output for the average age for each country
- Displaying the 13th row in the dataset
The following is the output
Bonus
- Parsing the file with the help of "," operator using ParseLine method and displaying the output
The following is the output
- Performing the label encoding on the column gender.
The following is the output
- Performing correlation on the age and label encoded gender column
The following is the output
- Performing covariance on the age and label encoded gender column
The following is the output
References:
2.http://spark.apache.org/docs/1.6.3/api/java/org/apache/spark/sql/DataFrameStatFunctions.html
3.https://spark.apache.org/docs/0.9.1/scala-programming-guide.html