Hadoop Streaming Tips - lalet/big-data-analytics-course GitHub Wiki
Welcome to the big-data-analytics-course wiki!
The normal command syntax to submit a streaming Hadoop Job is long.It is time consuming to set up the parameters before each run.
Listed below are some tips which you can use in order to speed up the MapReduce Job Submission if you are using streaming jar. i.e , if you are trying to run the map reduce job using mappers and reducers written in any language other than Java.
Submit a job command. hadoop jar -mapper -reducer -file -file -input -output
Example :
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py -input myinput -output joboutput
If you are using linux os or a mac os, you can follow the below instructions to speed up the process of job submission.
1.Open your .bashrc file in edit mode.
vim ~/.bashrc
User specific, hidden by default.
~/.bashrc
If not there simply create one.
System wide:
/etc/bash.bashrc
2.Add the below lines to your .bashrc or .bash_profile.
#Normal job with no combiner
run_mapreduce() {
hadoop jar <add your streaming jar file path here> -mapper $1 -reducer $2 -file $1 -file $2 -input $3 -output $4
}
#Job with combiner which can act as the reducer also:
run_mapreduce_with_combiner() {
hadoop jar <add your streaming jar file path here> -mapper $1 -reducer $2 -combiner $2 -file $1 -file $2 -input $3 -output $4
}
#Job with a combiner file written separately:
run_mapreduce_with_combiner2() {
hadoop jar <streaming jar file path here> -mapper $1 -combiner $2 -reducer $3 -file $1 -file $2 -file $3 -input $4 -output $5
}
alias hs=run_mapreduce
alias hsc=run_mapreduce_with_combiner
alias hsc2=run_mapreduce_with_combiner2
Sample:
run_mapreduce(){
hadoop jar /usr/local/Cellar/hadoop/2.7.3/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar
-mapper $1
-reducer $2
-file $1
-file $2
-input $3
-output $4
}
3.After adding the above lines to the .bashrc profile, save it.
4.Source the file using the command below,
source ~/.bashrc
or
source /etc/bash.bashrc
5.Running job commands after adding aliases
#Normal Map Reduce job
hs mapper.py reducer.py myinput joboutput
#Job command with reducer doing the function of a combiner
hsc mapper.py reducer.py myinput joboutput
#Job command with a separate combiner file
hsc2 mapper.py combiner.py reducer.py myinput joboutput
Assuming the mapper and reducer is written in python, the syntax is,
cat <test data path> | python <mapper name> | sort | python <reducer name>
Example:
cat test_posts.csv | python mapper.py | sort | python reducer.py