ICP 8 - manaswinivedula/Big-Data-Programming GitHub Wiki
steps followed for Installations and project creation
- I have downloaded, installed JDK, spark and Intellij
- After all these installations I have downloaded a spark plugin in IntelliJ.
- Then created a new sbt spark project named ICP 8.
- After the entire project structure has been built I have updated all the dependencies in the sbt file.
- After that, I have created a new scala object name in src-> main -> scala path.
Task 1 Word Count
- This is the overall map-reduce process that takes place for the word count.
-
Initially, the input file is read from the given path with the help of SparkContext.
-
each lines are split when there is a space between the words with the help of flat map spark transformation.
-
After splitting each word then they are reduced with the “Reduce by key” spark transformation.
-
Finally, the reduced output is sorted according to the keys in alphabetical order.
-
Then it is saved as output file with the “saveAsTextFile” spark action.
-
Then in order to count the unique words in the file performed “count” spark action.
-
with the help of “for each” spark action, the output of unique word count is printed on the console.
- The input text file is shown below.
- The output text file is shown below.
- The output printed on the console with the help of "for each" spark action.
- The output of the unique word "count" which is a spark action on the console is as follows
Task 2 Secondary Sorting
-
A user-defined function named “parsing_line” is defined for parsing the input data according to the required format.
-
Then the input file is read from the given path with the help of SparkContext.
-
each line is split when there is a comma separator between the words with the help of map spark transformation and then they are stored in the required format.
-
After splitting each word then they are reduced by the hash partitioner into key-value pairs.
-
Finally, the reduced output is grouped according to the dates and sorted according to the values in the list in the reducer phase.
-
with the help of “for each” spark action the final output is printed on the console.
-
Then it is saved as output file with the “saveAsTextFile” spark action.
- The input text file is shown below.
- The output text file is shown below.
- The output printed on the console with the help of "for each" spark action.