Hadoop Streaming Tips - lalet/big-data-analytics-course GitHub Wiki

Welcome to the big-data-analytics-course wiki!

The normal command syntax to submit a streaming Hadoop Job is long.It is time consuming to set up the parameters before each run.

Listed below are some tips which you can use in order to speed up the MapReduce Job Submission if you are using streaming jar. i.e , if you are trying to run the map reduce job using mappers and reducers written in any language other than Java.

Submit a job command. hadoop jar -mapper -reducer -file -file -input -output

Example :
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py -input myinput -output joboutput

If you are using linux os or a mac os, you can follow the below instructions to speed up the process of job submission.

1.Open your .bashrc file in edit mode.

vim ~/.bashrc

User specific, hidden by default.

~/.bashrc

If not there simply create one.

System wide:

/etc/bash.bashrc

2.Add the below lines to your .bashrc or .bash_profile.


#Normal job with no combiner
run_mapreduce() {
    hadoop jar <add your streaming jar file path here> -mapper $1 -reducer $2 -file $1 -file $2 -input $3 -output $4
}

#Job with combiner which can act as the reducer also:
run_mapreduce_with_combiner() {
    hadoop jar <add your streaming jar file path here> -mapper $1 -reducer $2 -combiner $2 -file $1 -file $2 -input $3 -output $4
}

#Job with a combiner file written separately:
run_mapreduce_with_combiner2() {
    hadoop jar <streaming jar file path here> -mapper $1 -combiner $2 -reducer $3 -file $1 -file $2 -file $3 -input $4 -output $5
}

alias hs=run_mapreduce
alias hsc=run_mapreduce_with_combiner
alias hsc2=run_mapreduce_with_combiner2

Sample:

run_mapreduce(){ 
hadoop jar /usr/local/Cellar/hadoop/2.7.3/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar 
-mapper $1 
-reducer $2 
-file $1 
-file $2 
-input $3 
-output $4
}

3.After adding the above lines to the .bashrc profile, save it.

4.Source the file using the command below,

source ~/.bashrc

or

source /etc/bash.bashrc

5.Running job commands after adding aliases

#Normal Map Reduce job
hs mapper.py reducer.py myinput joboutput

#Job command with reducer doing the function of a combiner
hsc mapper.py reducer.py myinput joboutput
 
#Job command with a separate combiner file
hsc2 mapper.py combiner.py reducer.py myinput joboutput

Running the tests locally

Assuming the mapper and reducer is written in python, the syntax is,

 cat <test data path> | python <mapper name> | sort | python <reducer name>

Example:

 cat test_posts.csv | python mapper.py | sort | python reducer.py
⚠️ **GitHub.com Fallback** ⚠️