ICP 2 - Murarishetti-Shiva-Kumar/Big-Data-Programming GitHub Wiki
Lesson 2: Hadoop MapReduce and Hadoop Distributed File System (HDFS)
Use case Description:
WordCount.txt) with MapReduce algorithm
1. Counting the frequency of words in the given input(We shall create project in Eclipse and import the required External JAR files.
Creating Class File and exporting it as JAR file
Creating directory in HDFS for input files
Copying the WordCount file from local to HDFS
Executing the generated JAR file. i.e: Executing the MapReduce code
It generates two files in the output folder, one is _SUCCESS and the other is part-r-00000. The output can be found in part-r-00000.
In Mapper we are Splitting the words into tokens, Mapping the tokens generated, then shuffling the tokens based on the first character
In Reducer, we are aggregating the shuffled tokens and generating the word count.
2. Counting the frequency of words in given text file that starts with letter ‘a’
Creating the class file, exporting it as JAR file.
Executing the generated JAR file
It generates two files in the output folder, one is _SUCCESS and the other is part-r-00000. The output can be found in part-r-00000.
In Mapper we are Splitting the words into tokens, Mapping the tokens generated, filtering the tokens starting with 'a'.
To increase efficiency we are filtering the tokens starting with 'a' in Mapper rather than reducer. This is because instead of dual loading i.e: Mapper complete data into local disk then reducer filtering, we are filtering at Mapper phase and output is stored into local disk, reducer executes the output of the Mapper from the local disk.
In Reducer, we are aggregating the tokens and generates the word count.
Bonus Question:
Determine the prime number in input and print number only once
Creating the class file, exporting it as JAR file. Executing the generated JAR file
It generates two files in the output folder, one is _SUCCESS and the other is part-r-00000. The output can be found in part-r-00000.
In Mapper, we are splitting the given input and generating tokens. If there is any repeated values the combiner passes the unique value of the keys to the reducer.
In Reducer, First we are converting the input to integer and performing the logic of prime number. If it is prime number displaying the output as 0 otherwise 1.