ICP 03 : MapReduce - acikgozmehmet/BigDataProgramming GitHub Wiki
ICP 03: Hadoop Distributed File System (HDFS)/ MapReduce and Big Data Applications
Objectives
- Create a Map-Reduce Program to perform the task of matrix multiplication
1. Matrix Multiplication in Map Reduce
Suppose we have a i x j matrix M, whose element in row i and column j will be denoted mij and a j x k matrix N whose element in row j and column k is donated by njk then the product P = MN will be i x k matrix P whose element in row i and column k will be donated by pik , where p(i,k) = mij * njk
hadoop fs -mkdir /user/cloudera/icp3
hadoop fs -mkdir /user/cloudera/icp3/input
hadoop fs -put ./input/Matrix*.* /user/cloudera/icp3/input
Please find the code here: https://github.com/acikgozmehmet/BigDataProgramming/blob/master/ICP-03/SourceCode/MatrixMultiply.java
Map Function
Map function will create (key, value) pairs from the input data as it is described in the following algorithm.
Algorithm 1: The Map Function
-
For each element (mij) of matrix_M, it will create (key, value) pairs as ((i,k), (M,j,mij)) for k =1, 2 , ... up to the number of columns of N.
-
For each element (njk) of matrix_N, it will create (key, value) pairs as ((i,k), (N,j,njk)) for k =1, 2 , ... up to the number of columns of M.
- We will have a set of (key, value) pairs that each key (i,k) has a list with values (M,j, mij) and (N,j,njk)) for all possible values of j.
Reduce Function
Reduce function will use the output of the Map function and perform the calculations and produces key,value pairs as described in the following algorithm. Please note that all outputs are written to HDFS.
Algorithm 2: The Reduce Function
-
For each key (i,k);
it will sort values begin with M by j in the list List-M it will sort values begin with N by j in the list List-N it will multiply mij abd njk for the jth value of each list. it will sum up mij*njk
-
It will return the result.
To run the program:
hadoop jar MatrixMultiply-1.0.jar MatrixMultiply /user/cloudera/icp3/input /user/cloudera/icp3/output
References :