ICP 2 - Murarishetti-Shiva-Kumar/Big-Data-Programming GitHub Wiki

Lesson 2: Hadoop MapReduce and Hadoop Distributed File System (HDFS)

Use case Description:

1. Counting the frequency of words in the given input(WordCount.txt) with MapReduce algorithm

We shall create project in Eclipse and import the required External JAR files.

Creating Class File and exporting it as JAR file

Creating directory in HDFS for input files

Copying the WordCount file from local to HDFS

Executing the generated JAR file. i.e: Executing the MapReduce code

It generates two files in the output folder, one is _SUCCESS and the other is part-r-00000. The output can be found in part-r-00000.

In Mapper we are Splitting the words into tokens, Mapping the tokens generated, then shuffling the tokens based on the first character

In Reducer, we are aggregating the shuffled tokens and generating the word count.

2. Counting the frequency of words in given text file that starts with letter ‘a’

Creating the class file, exporting it as JAR file.

Executing the generated JAR file

It generates two files in the output folder, one is _SUCCESS and the other is part-r-00000. The output can be found in part-r-00000.

In Mapper we are Splitting the words into tokens, Mapping the tokens generated, filtering the tokens starting with 'a'.

To increase efficiency we are filtering the tokens starting with 'a' in Mapper rather than reducer. This is because instead of dual loading i.e: Mapper complete data into local disk then reducer filtering, we are filtering at Mapper phase and output is stored into local disk, reducer executes the output of the Mapper from the local disk.

In Reducer, we are aggregating the tokens and generates the word count.

Bonus Question:

Determine the prime number in input and print number only once

Creating the class file, exporting it as JAR file. Executing the generated JAR file

It generates two files in the output folder, one is _SUCCESS and the other is part-r-00000. The output can be found in part-r-00000.

In Mapper, we are splitting the given input and generating tokens. If there is any repeated values the combiner passes the unique value of the keys to the reducer.

In Reducer, First we are converting the input to integer and performing the logic of prime number. If it is prime number displaying the output as 0 otherwise 1.