Hadoop QuickStart Guide - shawfdong/hyades GitHub Wiki
Add user dong to group hadoop:
[root@hyades ~]# usermod -G hadoop dong
Start Hadoop daemons on the GPU nodes:
[root@gpu-1 ~]# su - hduser [hduser@gpu-1 ~]$ start-dfs.sh [hduser@gpu-1 ~]$ start-yarn.sh [hduser@gpu-1 ~]$ mr-jobhistory-daemon.sh start historyserver
Create a directory /user in HDFS:
[hduser@gpu-1 ~]$ hdfs dfs -mkdir /user [hduser@gpu-1 ~]$ hdfs dfs -chmod 1777 /user
Create user dong's home directory in HDFS:
[dong@gpu-1 ~]$ module load hadoop [dong@gpu-1 ~]$ hdfs dfs -mkdir -p /user/dong
The following commands are equivalent:
[dong@gpu-1 ~]$ hdfs dfs -ls [dong@gpu-1 ~]$ hdfs dfs -ls . [dong@gpu-1 ~]$ hdfs dfs -ls /user/dong [dong@gpu-1 ~]$ hdfs dfs -ls hdfs://gpu-1/user/dong [dong@gpu-1 ~]$ hdfs dfs -ls hdfs://gpu-1:8020/user/dong [dong@gpu-1 ~]$ hadoop fs -ls
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks[1].
Here we demonstrate how to run Hadoop MapReduce applications on the 8 GPU nodes Hadoop cluster.
Make sure the hadoop module is loaded:
[dong@gpu-1 ~]$ module load hadoopwhich, among other things, defines the environment variable HADOOP_CLASSPATH:
HADOOP_CLASSPATH=/usr/java/latest/lib/tools.jarwhich is required to compile Java Hadoop programs.
Create the input directory in HDFS:
$ hdfs dfs -mkdir -p wordcount/input
For this guide, we'll use Herman Melville's classic novel Moby Dick as the raw input text. We'll run the WordCount example applications in the MapReduce Tutorial to counts the number of occurrences of each word in the text file.
$ wget http://www.gutenberg.org/cache/epub/2701/pg2701.txt $ hdfs dfs -put pg2701.txt wordcount/input/MobyDick.txtNOTE the copy stored in HDFS is renamed to MobyDick.txt.
Compile WordCount v1.0:
$ hadoop com.sun.tools.javac.Main WordCount.java $ jar cf wc.jar WordCount*.class
Run the application:
$ hadoop jar wc.jar WordCount wordcount/input wordcount/output
or
$ hadoop jar wc.jar WordCount wordcount/input/MobyDick.txt wordcount/output
WordCount v2.0 is a more complete version which uses many features provided by the MapReduce framework. However, there is a bug in the code[2].
Replace line 45 of WordCount2.java:
if (conf.getBoolean("wordcount.skip.patterns", true)) {with
if (conf.getBoolean("wordcount.skip.patterns", false)) {
Compile WordCount v2.0:
$ hadoop com.sun.tools.javac.Main WordCount2.java $ jar cf wc2.jar WordCount2*.class
Run the application:
$ hadoop jar wc2.jar WordCount2 wordcount/input wordcount/output2
If Java is not your cup of tea, you can use Hadoop Streaming to create and run Map/Reduce jobs with any executable (written in any language) or script as the mapper and/or the reducer[3].
$ hadoop jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-${HADOOP_VERSION}.jar -help Usage: $HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar [options] Options: -input <path> DFS input file(s) for the Map step. -output <path> DFS output directory for the Reduce step. -mapper <cmd|JavaClassName> Optional. Command to be run as mapper. -combiner <cmd|JavaClassName> Optional. Command to be run as combiner. -reducer <cmd|JavaClassName> Optional. Command to be run as reducer. -file <file> Optional. File/dir to be shipped in the Job jar file. Deprecated. Use generic option "-files" instead. -inputformat <TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName> Optional. The input format class. -outputformat <TextOutputFormat(default)|JavaClassName> Optional. The output format class. -partitioner <JavaClassName> Optional. The partitioner class. -numReduceTasks <num> Optional. Number of reduce tasks. -inputreader <spec> Optional. Input recordreader spec. -cmdenv <n>=<v> Optional. Pass env.var to streaming commands. -mapdebug <cmd> Optional. To run this script when a map task fails. -reducedebug <cmd> Optional. To run this script when a reduce task fails. -io <identifier> Optional. Format to use for input to and output from mapper/reducer commands -lazyOutput Optional. Lazily create Output. -background Optional. Submit the job and don't wait till it completes. -verbose Optional. Print verbose output. -info Optional. Print detailed usage. -help Optional. Print help message. Generic options supported are -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -fs <local|namenode:port> specify a namenode -jt <local|jobtracker:port> specify a job tracker -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions] For more details about these options: Use $HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar -info
Dr. Glenn K. Lockwood has written an excellent article demonstrating how to write Hadoop applications in Python with Hadoop Streaming[4]. Here we'll use his WordCount programs in Python as examples.
To test the mapper and reducer serially:
$ cat pg2701.txt | ./mapper.py | sort | ./reducer.py > output.txt
To launch the Hadoop job:
$ hadoop \ jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-${HADOOP_VERSION}.jar \ -mapper "python $PWD/mapper.py" \ -reducer "python $PWD/reducer.py" \ -input wordcount/input \ -output wordcount/output
or
$ hadoop \ jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-${HADOOP_VERSION}.jar \ -mapper "$PWD/mapper.py" \ -reducer "$PWD/reducer.py" \ -input wordcount/input \ -output wordcount/output