ICP 8 - a190884810/Big-Data-Programming GitHub Wiki
SPARK
PROBLEM STATEMENT
-
Write a spark program with an interesting use case using text data as the input and program should have at least Two Spark Transformations and Two Spark Actions. Present your use case in map reduce paradigm as shown below (for word count).
-
Secondary sorting is used to sort the values in the reducer phase. Take any input of your interest and perform secondary sorting on it.
FEATURES
- Using Intellij IDE and Scala programming language, Programs to perform wordcount and secondary sorting is written.
APPROACH
QUESTION 1
-
The following screenshots depicts the code written for wordcount program.
-
In the above code, A text file with few words is fed as an input. Based on the given regular expression, Word wise count is made and is generated as an output. Here, Two transformations used are map and flatmap. Two actions are also used. Those are foreach and take(). The following are the screenshots depicting input and output files.
INPUT
OUTPUT
QUESTION 2
-
This code performs secondary sorting. The following screenshots depicts the source code and the input and output files. Secondary sorting is a technique where values are sorted in the reduce phase along with keys in Map-Reduce phase. In our code, Splitting of file is done using a ','. Input file is also provided accordingly. Followed by splitting of data, Mapping is also performed. Partitioning and sorting of data is done in the following steps.
-
Initially, Grouping of data is done based on composite key, Here the composite key is animal names. Once after the grouping is done, Sorting will be carried out.
INPUT
OUTPUT