SPARK ICP 1 - navyagonug/CS5590-BIG-DATA-PROGRAMMING-USING-HADOOP-AND-SPARK GitHub Wiki

PROBLEM STATEMENT

  1. Write a spark program with an interesting use case using text data as the input and program should have at least Two Spark Transformations and Two Spark Actions. Present your use case in map reduce paradigm as shown below (for word count).

  2. Secondary sorting is used to sort the values in the reducer phase. Take any input of your interest and perform secondary sorting on it.

3.(BONUS) Looking at the word count example above, count the frequency of characters Implement the example using character count in the given input with MapReduce algorithm

FEATURES

Using Intellij IDE and Scala programming language, Programs to perform wordcount and secondary sorting is written. Alongside, A character count program is also done as a part of bonus credit.

APPROACH

QUESTION 1

The following screenshots depicts the code written for wordcount program.

In the above code, A text file with few words is fed as an input. Based on the given regular expression, Word wise count is made and is generated as an output. Here, Two transformations used are map and flatmap. Two actions are also used. Those are foreach and take(). The following are the screenshots depicting input and output files.

INPUT

OUTPUT

QUESTION 2

This code performs secondary sorting. The following screenshots depicts the source code and the input and output files. Secondary sorting is a technique where values are sorted in the reduce phase along with keys in Map-Reduce phase. In our code, Splitting of file is done using a ','. Input file is also provided accordingly. Followed by splitting of data, Mapping is also performed. Partitioning and sorting of data is done in the following steps.

  1. Initially, Grouping of data is done based on composite key, Here the composite key is animal names. Once after the grouping is done, Sorting will be carried out.

INPUT

OUTPUT

QUESTION 3 (BONUS)

The same approach is used that is used in question 1. However, regular expression is changed so that character wise count is done. The following screenshots depicts the source code used.

INPUT

OUTPUT

CONFIGURATIONS

The build.sbt file has been configured and the following library dependencies have been added.

REFERENCES

  1. https://spark.apache.org/examples.html
  2. http://codingjunkie.net/spark-secondary-sort/
  3. http://timepasstechies.com/spark-secondary-sorting-example-using-rdd-dataframe/