ICP2_3 - Hiresh12/Big-Data-Programming GitHub Wiki
DataFrames & SQL in Scala/Pyspark
Task:
To write spark SQL queries to import and export data and to perform some manipulations on data.
Features:
- Spark
- python
- Jupiter Notebook
Tasks:
Part –1:
1.Import the dataset and create data frames directly on import.
2.Save data to file.
Output File:
3.Check for Duplicate records in the dataset.
4.Apply Union operation on the dataset and order the output by CountryName alphabetically.
5.Use Groupby Query based ontreatment.
Part –2:
1.Apply the basic queries related to Joins and aggregate functions (at least 2)
2.Write a query to fetch 13th Row in the dataset.
Part –3:(bonus)
1.Write a parseLine method to split the comma-delimited row and create a Data frame.
References
https://stackoverflow.com/questions/51689460/select-specific-columns-from-spark-dataframe
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-aggregate-functions.html