Hadoop QuickStart Guide - shawfdong/hyades GitHub Wiki

Table of Contents Preliminaries Hadoop MapReduce WordCount v1.0 WordCount v2.0 Hadoop Streaming References

Preliminaries

Add user dong to group hadoop:

[root@hyades ~]# usermod -G hadoop dong

Start Hadoop daemons on the GPU nodes:

[root@gpu-1 ~]# su - hduser
[hduser@gpu-1 ~]$ start-dfs.sh
[hduser@gpu-1 ~]$ start-yarn.sh
[hduser@gpu-1 ~]$ mr-jobhistory-daemon.sh start historyserver

Create a directory /user in HDFS:

[hduser@gpu-1 ~]$ hdfs dfs -mkdir /user                        
[hduser@gpu-1 ~]$ hdfs dfs -chmod 1777 /user

Create user dong's home directory in HDFS:

[dong@gpu-1 ~]$ module load hadoop
[dong@gpu-1 ~]$ hdfs dfs -mkdir -p /user/dong

The following commands are equivalent:

[dong@gpu-1 ~]$ hdfs dfs -ls
[dong@gpu-1 ~]$ hdfs dfs -ls .
[dong@gpu-1 ~]$ hdfs dfs -ls /user/dong
[dong@gpu-1 ~]$ hdfs dfs -ls hdfs://gpu-1/user/dong
[dong@gpu-1 ~]$ hdfs dfs -ls hdfs://gpu-1:8020/user/dong

[dong@gpu-1 ~]$ hadoop fs -ls

Hadoop MapReduce

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks^[1].

Here we demonstrate how to run Hadoop MapReduce applications on the 8 GPU nodes Hadoop cluster.

Make sure the hadoop module is loaded:

[dong@gpu-1 ~]$ module load hadoop

which, among other things, defines the environment variable HADOOP_CLASSPATH:

HADOOP_CLASSPATH=/usr/java/latest/lib/tools.jar

which is required to compile Java Hadoop programs.

Create the input directory in HDFS:

$ hdfs dfs -mkdir -p wordcount/input

For this guide, we'll use Herman Melville's classic novel Moby Dick as the raw input text. We'll run the WordCount example applications in the MapReduce Tutorial to counts the number of occurrences of each word in the text file.

$ wget http://www.gutenberg.org/cache/epub/2701/pg2701.txt
$ hdfs dfs -put pg2701.txt wordcount/input/MobyDick.txt

NOTE the copy stored in HDFS is renamed to MobyDick.txt.

WordCount v1.0

Compile WordCount v1.0:

$ hadoop com.sun.tools.javac.Main WordCount.java 
$ jar cf wc.jar WordCount*.class

Run the application:

$ hadoop jar wc.jar WordCount wordcount/input wordcount/output

$ hadoop jar wc.jar WordCount wordcount/input/MobyDick.txt wordcount/output

WordCount v2.0

WordCount v2.0 is a more complete version which uses many features provided by the MapReduce framework. However, there is a bug in the code^[2].

Replace line 45 of WordCount2.java:

      if (conf.getBoolean("wordcount.skip.patterns", true)) {

with

      if (conf.getBoolean("wordcount.skip.patterns", false)) {

Compile WordCount v2.0:

$ hadoop com.sun.tools.javac.Main WordCount2.java 
$ jar cf wc2.jar WordCount2*.class

Run the application:

$ hadoop jar wc2.jar WordCount2 wordcount/input wordcount/output2

Hadoop Streaming

If Java is not your cup of tea, you can use Hadoop Streaming to create and run Map/Reduce jobs with any executable (written in any language) or script as the mapper and/or the reducer^[3].

$ hadoop jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-${HADOOP_VERSION}.jar -help
Usage: $HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar [options]
Options:
  -input          <path> DFS input file(s) for the Map step.
  -output         <path> DFS output directory for the Reduce step.
  -mapper         <cmd|JavaClassName> Optional. Command to be run as mapper.
  -combiner       <cmd|JavaClassName> Optional. Command to be run as combiner.
  -reducer        <cmd|JavaClassName> Optional. Command to be run as reducer.
  -file           <file> Optional. File/dir to be shipped in the Job jar file.
                  Deprecated. Use generic option "-files" instead.
  -inputformat    <TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName>
                  Optional. The input format class.
  -outputformat   <TextOutputFormat(default)|JavaClassName>
                  Optional. The output format class.
  -partitioner    <JavaClassName>  Optional. The partitioner class.
  -numReduceTasks <num> Optional. Number of reduce tasks.
  -inputreader    <spec> Optional. Input recordreader spec.
  -cmdenv         <n>=<v> Optional. Pass env.var to streaming commands.
  -mapdebug       <cmd> Optional. To run this script when a map task fails.
  -reducedebug    <cmd> Optional. To run this script when a reduce task fails.
  -io             <identifier> Optional. Format to use for input to and output
                  from mapper/reducer commands
  -lazyOutput     Optional. Lazily create Output.
  -background     Optional. Submit the job and don't wait till it completes.
  -verbose        Optional. Print verbose output.
  -info           Optional. Print detailed usage.
  -help           Optional. Print help message.

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|jobtracker:port>    specify a job tracker
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]


For more details about these options:
Use $HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar -info

Dr. Glenn K. Lockwood has written an excellent article demonstrating how to write Hadoop applications in Python with Hadoop Streaming^[4]. Here we'll use his WordCount programs in Python as examples.

To test the mapper and reducer serially:

$ cat pg2701.txt | ./mapper.py | sort | ./reducer.py > output.txt

To launch the Hadoop job:

$ hadoop \
    jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-${HADOOP_VERSION}.jar \
    -mapper "python $PWD/mapper.py" \
    -reducer "python $PWD/reducer.py" \
    -input wordcount/input \
    -output wordcount/output

$ hadoop \
    jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-${HADOOP_VERSION}.jar \
    -mapper "$PWD/mapper.py" \
    -reducer "$PWD/reducer.py" \
    -input wordcount/input \
    -output wordcount/output

Hadoop QuickStart Guide - shawfdong/hyades GitHub Wiki

Table of Contents

Preliminaries

Hadoop MapReduce

WordCount v1.0

WordCount v2.0

Hadoop Streaming

References

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️