BDP _ ICP2 - aaackc/Big-data-programing GitHub Wiki
Title
Hadoop Mapreduce and Hadoop Distributed File System (HDFS)
Description
Objectives:
- Counting the frequency of words in the given input with MapReduce algorithm
- Counting the frequency of words in given text file that starts with letter ‘a’
- Determinning the prime number in input and print number only once (Bonus)
How map reduce works?
MapReduce is the heart of Hadoop. It is a programming model designed for processing huge volumes of data (both structured as well as unstructured) in parallel by dividing the work into a set of independent sub-work (tasks).
MapReduce is the combination of two different processing idioms called Map and Reduce, where we can specify our custom business logic. The map is the first phase of processing, where we specify all the complex logic/business rules/costly code. On the other hand, Reduce is the second phase of processing, where we specify light-weight processing. For example, aggregation/summation.
Below are the steps for MapReduce data flow:
Step 1: One block is processed by one mapper at a time. In the mapper, a developer can specify his own business logic as per the requirements. In this manner, Map runs on all the nodes of the cluster and process the data blocks in parallel.
Step 2: Output of Mapper also known as intermediate output is written to the local disk. An output of mapper is not stored on HDFS as this is temporary data and writing on HDFS will create unnecessary many copies.
Step 3: Output of mapper is shuffled to reducer node (which is a normal slave node but reduce phase will run here hence called as reducer node). The shuffling/copying is a physical movement of data which is done over the network.
Step 4: Once all the mappers are finished and their output is shuffled on reducer nodes then this intermediate output is merged & sorted. Which is then provided as input to reduce phase.
Step 5: Reduce is the second phase of processing where the user can specify his own custom business logic as per the requirements. An input to a reducer is provided from all the mappers. An output of reducer is the final output, which is written on HDFS.
1. Counting the frequency of words in the given input with MapReduce algorithm
Algorithm
In the map phase;
Set of data is converted into another set of data, where individual elements are broken down into key-value pairs.
In the reduce phase;
For every key :
`a variable called as sum is initiated and set to 0.`
`sum is incremented if the key is repeated.`
`finally, sum is written to a output file on HDFS.`
Commands comments:
[cloudera@quickstart ~]$ hadoop fs -mkdir /user/cloudera/icp2/input_wordcount
Creating a directory in Hadoop
[cloudera@quickstart ~]$ cd Downloads
Change Directory to Downloads
[cloudera@quickstart Downloads]$ hadoop fs -put sample.txt /user/cloudera/icp2/input_wordcount
Putting the file from local system to the Hadoop ICP2 folder.
[cloudera@quickstart Downloads]$ hadoop jar WordFreqCount.jar WordCount /user/cloudera/icp2/input_wordcount /user/cloudera/icp2/output_wordcount
Running the Hadoop Jar File Commands to run the MapReducer to count words in the input file.
[cloudera@quickstart Downloads]$ hadoop fs -ls /user/cloudera/icp2/output_wordcount
Listing the files in a directory, two items present.
[cloudera@quickstart Downloads]$ hadoop fs -cat /user/cloudera/icp2/output_wordcount/part-r-00000
View the wordcount output.
Output
Viewing output file HUE
2. Counting the frequency of words in given text file that starts with letter ‘a’
Algorithm
In the map phase;
Set of data is converted into another set of data, where individual elements are broken down into key-value pairs.
In the reduce phase;
`For every key :
a variable called as sum is initiated and set to 0.
sum is incremented if the key starts with 'a'.
finally, sum is written to a output file on HDFS.`
Commands comments:
[cloudera@quickstart ~]$ hadoop fs -mkdir /user/cloudera/icp2/input_wordstartingwitha
Creating a directory in Hadoop
[cloudera@quickstart ~]$ cd Downloads
Change Directory to Downloads
[cloudera@quickstart Downloads]$ hadoop fs -put inputfile.txt /user/cloudera/icp2/input_wordstartingwitha
Putting the file from local system to the Hadoop ICP2 folder.
[cloudera@quickstart Downloads]$ hadoop jar WordFreqaCount.jar WordCount /user/cloudera/icp2/input_wordstartingwitha /user/cloudera/icp2/output_wordstartingwitha
Running the Hadoop Jar File Commands to run the MapReducer to count words starting with a in the input file.
[cloudera@quickstart Downloads]$ hadoop fs -ls /user/cloudera/icp2/output_wordstartingwitha
Listing the files in a directory, two items present.
[cloudera@quickstart Downloads]$ hadoop fs -cat /user/cloudera/icp2/output_wordstartingwitha/part-r-00000
View the wordcount of word starting with letter a.
Output
Viewing output file HUE
3. Determine the prime number in input and print number only once
Algorithm
In the map phase;
Set of data is converted into another set of data, where individual elements are broken down into key-value pairs.
In the reduce phase;
For every key :
`For every key (number):
a variable called as sum is initiated and set to 0 (assuming that it is a prime number).
sum is set to 1 (not prime) only if the number is divisible by any of the numbers between 2 and square-root of the number itself (including )
finally, sum is written to a output file on HDFS.`
Commands comments:
[cloudera@quickstart ~]$ hadoop fs -mkdir /user/cloudera/icp2/input_numbers
Creating a directory in Hadoop
[cloudera@quickstart ~]$ cd Downloads
Change Directory to Downloads
[cloudera@quickstart Downloads]$ hadoop fs -put numbers.txt /user/cloudera/icp2/input_numbers
Putting the file from local system to the Hadoop ICP2 folder.
[cloudera@quickstart Downloads]$ hadoop jar PrimeNumber.jar WordCount /user/cloudera/icp2/input_numbers /user/cloudera/icp2/output_numbers
Running the Hadoop Jar File Commands to run the MapReducer to count words in the input file.
[cloudera@quickstart Downloads]$ hadoop fs -ls /user/cloudera/icp2/output_numbers
Listing the files in a directory, two items present.
[cloudera@quickstart Downloads]$ hadoop fs -cat /user/cloudera/icp2/output_numbers/part-r-00000
View the wordcount output.
Output
Viewing output file HUE
Learnings from the Lesson:
We have learned how to create java projects to run Mapreduce using Hadoop libraries, then we tested our algrithms on some of the test data and find out different results as mentioned above.
Limitations:
There were few limitations with the Bonus task as there is no proper way to represents number data into text file.
References:
https://data-flair.training/blogs/hadoop-mapreduce-flow/ https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html