Hadoop Wordcount Java - cchantra/bigdata.github.io GitHub Wiki
source: http://hanishblogger.blogspot.com/2012/12/how-to-run-examples-of-hadoop.html
In hadoop-version directory there is always a jar containing all examples. We can run them directly by passing required arguments. Let say we have hadoop-3.2.1 then there is jar of all examples i.e. /home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar or using in cloudera
Download any text file for giving it as input to wordcount program. Or you can use
wget ftp://hadoop_ftp:[email protected]/bigdata/4300.txt
Copy this file into any directory lets say it is: /home/hadoop/dft/4300.txt
cd /home/hadoop
mkdir dft
cd dft
wget https://raw.githubusercontent.com/cchantra/bigdata.github.io/refs/heads/master/wordcount/4300.txt
cp /home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar .
Now since this data on your local machine so copy these data files into HDFS. Run the following command:
hdfs dfs -copyFromLocal /home/hadoop/dft/4300.txt /dft
** make sure you have created /dft in hadoop file system first.
hdfs dfs -mkdir /dft
Check the contents of your directory:
hdfs dfs -ls /dft
hdfs dfs -cat /dft/4300.txt
Please edit /home/hadoop/hadoop/etc/hadoop/mapred-site.xml
to contain the following path (classpath):
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property> <name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property> <name>mapreduce.application.classpath</name> <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/common/*,$HADOOP_MAPRED_HOME/share/hadoop/common/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/lib/*</value> </property>
<property> <name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property> <name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property> <name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
</configuration>
Now run your program:
cd /home/hadoop/dft
hadoop jar ./hadoop-mapreduce-examples-3.2.1.jar wordcount /dft /dft-output
You will see output:
15/04/25 17:34:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/04/25 17:34:28 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/04/25 17:34:31 INFO input.FileInputFormat: Total input paths to process : 1
15/04/25 17:34:31 INFO mapreduce.JobSubmitter: number of splits:1
15/04/25 17:34:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1429946598372_0001
15/04/25 17:34:32 INFO impl.YarnClientImpl: Submitted application application_1429946598372_0001
15/04/25 17:34:32 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1429946598372_0001/
15/04/25 17:34:32 INFO mapreduce.Job: Running job: job_1429946598372_0001
^C[hadoop@localhost ~]$ hadoop jar ./hadoop-mapreduce-examples-2.6.jar wordcount /dft /dft-output
15/04/25 17:34:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/04/25 17:34:50 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/04/25 17:34:55 INFO input.FileInputFormat: Total input paths to process : 1
15/04/25 17:34:56 INFO mapreduce.JobSubmitter: number of splits:1
15/04/25 17:34:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1429946598372_0002
15/04/25 17:34:57 INFO impl.YarnClientImpl: Submitted application application_1429946598372_0002
15/04/25 17:34:57 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1429946598372_0002/
15/04/25 17:34:57 INFO mapreduce.Job: Running job: job_1429946598372_0002
15/04/25 17:35:59 INFO mapreduce.Job: Job job_1429946598372_0002 running in uber mode : false
15/04/25 17:35:59 INFO mapreduce.Job: map 0% reduce 0%
15/04/25 17:36:12 INFO mapreduce.Job: map 100% reduce 0%
15/04/25 17:36:27 INFO mapreduce.Job: map 100% reduce 100%
15/04/25 17:36:28 INFO mapreduce.Job: Job job_1429946598372_0002 completed successfully
15/04/25 17:36:28 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=725025
FILE: Number of bytes written=1661195
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1573143
HDFS: Number of bytes written=527522
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=10875
Total time spent by all reduces in occupied slots (ms)=11743
Total time spent by all map tasks (ms)=10875
Total time spent by all reduce tasks (ms)=11743
Total vcore-seconds taken by all map tasks=10875
Total vcore-seconds taken by all reduce tasks=11743
Total megabyte-seconds taken by all map tasks=11136000
Total megabyte-seconds taken by all reduce tasks=12024832
Map-Reduce Framework
Map input records=33055
Map output records=267975
Map output bytes=2601773
Map output materialized bytes=725025
Input split bytes=99
Combine input records=267975
Combine output records=50091
Reduce input groups=50091
Reduce shuffle bytes=725025
Reduce input records=50091
Reduce output records=50091
Spilled Records=100182
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=323
CPU time spent (ms)=4940
Physical memory (bytes) snapshot=269115392
Virtual memory (bytes) snapshot=4150345728
Total committed heap usage (bytes)=137498624
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1573044
File Output Format Counters
Bytes Written=527522
see the output in hadoop directory:
hdfs dfs -ls /
hdfs dfs -ls /dft-output
Finally, let's take a look at the output
hdfs dfs -cat /dft-output/part-r-00000 | less
If we wanted to copy the output file to our local storage (remember, the output is automatically created in the HDFS world, and we have to copy the data from there to our file system to work on it):
hdfs dfs -copyToLocal /dft-output/part-r-00000 .
To remove the output directory from hdfs
hdfs dfs -rm -r /dft-output
Note: The hadoop program WordCount will not run another time if the output directory exists. It always wants to create a new one, so we'll have to remove the output directory regularly after having saved the output of each job.
You may want to start job history server:
mr-jobhistory-daemon.sh start historyserver
Compile Wordcount
ref: (https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Source+Code)
Make sure you setup the following in .bashrc
Add the following to .bashrc
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export JAVA_HOME="$(jrunscript -e 'java.lang.System.out.println(java.lang.System.getProperty("java.home"));')"
export PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
export CLASSPATH=./
export CLASSPATH=$CLASSPATH:`hadoop classpath`:.:
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
Then compile WordCount.java
mkdir wordcount_classes
javac -d wordcount_classes WordCount.java
jar cvf wc.jar -C wordcount_classes/ .
Run the program
hadoop jar wc.jar WordCount /dft /dft-output2
Try other examples: (/home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar)
For running hadoop's pi example: (wget https://code.google.com/p/mrs-mapreduce/source/browse/examples/pi/Pi.java)
Enter below command:
hadoop jar hadoop-mapreduce-examples-3.2.1.jar pi 10 10
You can try to see th web resource manager: (While running the job, see what is going on?)
http://<ip>:8088/cluster/nodes

Note: last two are arguments which define numbers of map task and reduce task respectively. We can specify any no.'s here. For running hadoop's sudoku example: Enter below command:
bin/hadoop jar hadoop-mapreduce-examples-3.2.1.jar sudoku puzzle1.dta