ICP 8 - manaswinivedula/Big-Data-Programming GitHub Wiki

steps followed for Installations and project creation

  1. I have downloaded, installed JDK, spark and Intellij
  2. After all these installations I have downloaded a spark plugin in IntelliJ.
  3. Then created a new sbt spark project named ICP 8.
  4. After the entire project structure has been built I have updated all the dependencies in the sbt file.

  1. After that, I have created a new scala object name in src-> main -> scala path.

Task 1 Word Count

  1. This is the overall map-reduce process that takes place for the word count.

  • Initially, the input file is read from the given path with the help of SparkContext.

  • each lines are split when there is a space between the words with the help of flat map spark transformation.

  • After splitting each word then they are reduced with the “Reduce by key” spark transformation.

  • Finally, the reduced output is sorted according to the keys in alphabetical order.

  • Then it is saved as output file with the “saveAsTextFile” spark action.

  • Then in order to count the unique words in the file performed “count” spark action.

  • with the help of “for each” spark action, the output of unique word count is printed on the console.

  1. The input text file is shown below.

  1. The output text file is shown below.

  1. The output printed on the console with the help of "for each" spark action.

  1. The output of the unique word "count" which is a spark action on the console is as follows

Task 2 Secondary Sorting

  • A user-defined function named “parsing_line” is defined for parsing the input data according to the required format.

  • Then the input file is read from the given path with the help of SparkContext.

  • each line is split when there is a comma separator between the words with the help of map spark transformation and then they are stored in the required format.

  • After splitting each word then they are reduced by the hash partitioner into key-value pairs.

  • Finally, the reduced output is grouped according to the dates and sorted according to the values in the list in the reducer phase.

  • with the help of “for each” spark action the final output is printed on the console.

  • Then it is saved as output file with the “saveAsTextFile” spark action.

  1. The input text file is shown below.

  1. The output text file is shown below.

  1. The output printed on the console with the help of "for each" spark action.

References:

  1. http://allaboutscala.com/tutorials/chapter-1-getting-familiar-intellij-ide/scala-tutorial-first-hello-world-application/

  2. https://data-flair.training/blogs/spark-rdd-operations-transformations-actions/#:~:text=Two%20types%20of%20Apache%20Spark,that%20point%20Action%20is%20performed.

  3. https://umkc.app.box.com/s/ujksg7hnkoz6yg5oxdqy0z5dp83fjbwp