ICP 02 : Hadoop and MapReduce - acikgozmehmet/BigDataProgramming GitHub Wiki

ICP-02 : Hadoop MapReduce and Hadoop Distributed File System (HDFS)

Objectives

Counting the frequency of words in the given input with MapReduce algorithm
Counting the frequency of words in given text file that starts with letter ‘a’
Determine the prime number in input and print number only once (Bonus)

How Hadoop MapReduce Works?

MapReduce is the heart of Hadoop. It is a programming model designed for processing huge volumes of data (both structured as well as unstructured) in parallel by dividing the work into a set of independent sub-work (tasks).

MapReduce is the combination of two different processing idioms called Map and Reduce, where we can specify our custom business logic. The map is the first phase of processing, where we specify all the complex logic/business rules/costly code. On the other hand, Reduce is the second phase of processing, where we specify light-weight processing. For example, aggregation/summation.

Below are the steps for MapReduce data flow:

Step 1: One block is processed by one mapper at a time. In the mapper, a developer can specify his own business logic as per the requirements. In this manner, Map runs on all the nodes of the cluster and process the data blocks in parallel.

Step 2: Output of Mapper also known as intermediate output is written to the local disk. An output of mapper is not stored on HDFS as this is temporary data and writing on HDFS will create unnecessary many copies.

Step 3: Output of mapper is shuffled to reducer node (which is a normal slave node but reduce phase will run here hence called as reducer node). The shuffling/copying is a physical movement of data which is done over the network.

Step 4: Once all the mappers are finished and their output is shuffled on reducer nodes then this intermediate output is merged & sorted. Which is then provided as input to reduce phase.

Step 5: Reduce is the second phase of processing where the user can specify his own custom business logic as per the requirements. An input to a reducer is provided from all the mappers. An output of reducer is the final output, which is written on HDFS.

MapReduce_Process

1. Counting the frequency of words in the given input with MapReduce algorithm

https://github.com/acikgozmehmet/BigDataProgramming/blob/master/ICP-02/SourceCode/WordCount.zip

hadoop fs -mkdir /user/cloudera/icp2

hadoop fs -mkdir /user/cloudera/icp2/input_wordcount

hadoop fs -put sample.txt /user/cloudera/icp2/input_wordcount

Sample.txt

Algorithm

In the map phase;

 Set of data is converted into another set of data, where individual elements are broken down into key-value pairs.

In the reduce phase;

 For every key :

     a variable called as sum is initiated and set to 0.
     sum is incremented if the key is repeated.
     finally, sum is written to a output file on HDFS.

hadoop jar WordCount-1.0-SNAPSHOT.jar WordCount /user/cloudera/icp2/input_wordcount /user/cloudera/icp2/output_wordcount

hadoop fs -ls /user/cloudera/icp2/output_wordcount

hadoop fs -cat /user/cloudera/icp2/output_wordcount/part-r-00000

Output-Sample.txt

2. Counting the frequency of words in given text file that starts with letter ‘a’

Algorithm

In the map phase;

 Set of data is converted into another set of data, where individual elements are broken down into key-value pairs.

In the reduce phase;

 For every key :

     a variable called as sum is initiated and set to 0.
     sum is incremented if the key starts with 'a'.
     finally, sum is written to a output file on HDFS.

Code#2

hadoop fs -mkdir /user/cloudera/icp2/input_wordstartingwitha

hadoop fs -put inputfile /user/cloudera/icp2/input_wordcount

Words starting with 'a'

hadoop jar WordCountStartingWithA-1.0-SNAPSHOT.jar WordCount /user/cloudera/icp2/input_wordstartingwitha /user/cloudera/icp2/output_wordstartingwitha

hadoop fs -ls /user/cloudera/icp2/output_wordstartingwitha

hadoop fs -cat /user/cloudera/icp2/output_wordstartingwitha/part-r-00000

output-starting with 'a'

Bonus Question: Determine the prime number in input and print number only once

Algorithm

In the map phase;

 Set of data is converted into another set of data, where individual elements are broken down into key-value pairs.

In the reduce phase;

 For every key (number):

     a variable called as sum is initiated and set to 0 (assuming that it is a prime number).
     sum is set to 1 (not prime) only if the number is divisible by any of the numbers between 2 and square-root of the number itself (including ) 
     finally, sum is written to a output file on HDFS.

Code#Bonus

hadoop fs -mkdir /user/cloudera/icp2/input_numbers

hadoop fs -put numbers /user/cloudera/icp2/input_numbers

numbers

hadoop jar PrimeNumber-1.0-SNAPSHOT.jar WordCount /user/cloudera/icp2/input_numbers /user/cloudera/icp2/output_numbers

hadoop fs -cat /user/cloudera/icp2/output_numbers/part-r-00000

output-numbers

References

https://data-flair.training/blogs/hadoop-mapreduce-flow/
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
UMKC Course Notes - CSEE5590-0001/490-0003: Big Data Programming