ICP 2 - Gnkhakimova/CS5590-BigData GitHub Wiki

ICP 2

Hadoop MapReduce and Hadoop Distributed File System (HDFS)

Source Code
Bonus Source Code
Output - Part 1
Output - Part 2
Bonus Output

Tasks

  1. Count frequency of the word for given text file using MapReduce function.
  2. Count frequency of the word which starts with letter "a" using MapReduce function.

Configuration

  1. Oracle Virtual Box
  2. Cloudera
  3. Eclipse IDE

Features

For this task we had to use MapReduce function to perform word count operation. Code if written in Java using Eclipse IDE. First task should output word count and second task should count words which start with letter "a". Input first stored to HDFS file which in fed to MapReduce function and output will be stored on HDFS as well.
1. Input files
Download two input files locally. Using Cloudera terminal create folder in HDFS and store those input files inside HDFS.

2. Implementation - Part 1
Used given code to run Word Count function to perform word count. Word count class takes input file and calls "map" function first and then "reduce" function.
Run WordCount class from terminal using following command.

3. Implementation - Part 2
Made change to map function by adding additional logic which will only map words which starts with leading "a".

4. Output
Output final result to HDFS file and from HDFS file copies it to local system.

Bonus

For bonus part we had to check of given number is prime or not. Updated reduce function to check for prime number using MapReduce function. "1" - Prime number, "0" - not Prime number.

Limitation

  1. had to spent some time to configure environment.

Reference

  1. https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html
  2. http://www.shabdar.org/hadoop-java/138-how-to-create-and-run-eclipse-project-with-a-mapreduce-sample.html