ICP2_3 - Hiresh12/Big-Data-Programming GitHub Wiki

DataFrames & SQL in Scala/Pyspark

Task:

To write spark SQL queries to import and export data and to perform some manipulations on data.

Features:

  • Spark
  • python
  • Jupiter Notebook

Tasks:

Part –1:

1.Import the dataset and create data frames directly on import.

2.Save data to file.

Output File:

https://github.com/Hiresh12/Big-Data-Programming/blob/master/ICP10/Documents/surveycopy1.csv/part-00000-a1359f40-a235-479b-a622-88e0ad6f0528-c000.csv

3.Check for Duplicate records in the dataset.

4.Apply Union operation on the dataset and order the output by CountryName alphabetically.

5.Use Groupby Query based ontreatment.

Part –2:

1.Apply the basic queries related to Joins and aggregate functions (at least 2)

2.Write a query to fetch 13th Row in the dataset.

Part –3:(bonus)

1.Write a parseLine method to split the comma-delimited row and create a Data frame.

References

https://stackoverflow.com/questions/51689460/select-specific-columns-from-spark-dataframe

https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-aggregate-functions.html