ICP 2 - Murarishetti-Shiva-Kumar/Big-Data-Programming GitHub Wiki

Lesson 2: Hadoop MapReduce and Hadoop Distributed File System (HDFS)

Use case Description:

image

1. Counting the frequency of words in the given input(WordCount.txt) with MapReduce algorithm

We shall create project in Eclipse and import the required External JAR files.

image

image

image

image

Creating Class File and exporting it as JAR file

image

image

image

image

image

image

Creating directory in HDFS for input files

image

Copying the WordCount file from local to HDFS

image

image

Executing the generated JAR file. i.e: Executing the MapReduce code

image

image

It generates two files in the output folder, one is _SUCCESS and the other is part-r-00000. The output can be found in part-r-00000.

image

In Mapper we are Splitting the words into tokens, Mapping the tokens generated, then shuffling the tokens based on the first character

image

In Reducer, we are aggregating the shuffled tokens and generating the word count.

image

2. Counting the frequency of words in given text file that starts with letter ‘a’

Creating the class file, exporting it as JAR file.

image

Executing the generated JAR file

image

It generates two files in the output folder, one is _SUCCESS and the other is part-r-00000. The output can be found in part-r-00000.

image

In Mapper we are Splitting the words into tokens, Mapping the tokens generated, filtering the tokens starting with 'a'.

To increase efficiency we are filtering the tokens starting with 'a' in Mapper rather than reducer. This is because instead of dual loading i.e: Mapper complete data into local disk then reducer filtering, we are filtering at Mapper phase and output is stored into local disk, reducer executes the output of the Mapper from the local disk.

image

In Reducer, we are aggregating the tokens and generates the word count.

image

Bonus Question:

Determine the prime number in input and print number only once

Creating the class file, exporting it as JAR file. Executing the generated JAR file

image

It generates two files in the output folder, one is _SUCCESS and the other is part-r-00000. The output can be found in part-r-00000.

image

In Mapper, we are splitting the given input and generating tokens. If there is any repeated values the combiner passes the unique value of the keys to the reducer.

image

In Reducer, First we are converting the input to integer and performing the logic of prime number. If it is prime number displaying the output as 0 otherwise 1.

image

image