Hadoop Wordcount Java - cchantra/bigdata.github.io GitHub Wiki

source: http://hanishblogger.blogspot.com/2012/12/how-to-run-examples-of-hadoop.html

In hadoop-version directory there is always a jar containing all examples. We can run them directly by passing required arguments. Let say we have hadoop-3.2.1 then there is jar of all examples i.e. /home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar or using in cloudera

Download any text file for giving it as input to wordcount program. Or you can use

wget ftp://hadoop_ftp:[email protected]/bigdata/4300.txt

Copy this file into any directory lets say it is: /home/hadoop/dft/4300.txt

cd /home/hadoop 
mkdir dft
cd dft 
wget https://raw.githubusercontent.com/cchantra/bigdata.github.io/refs/heads/master/wordcount/4300.txt
cp /home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar .

Now since this data on your local machine so copy these data files into HDFS. Run the following command:

hdfs dfs -copyFromLocal /home/hadoop/dft/4300.txt /dft

** make sure you have created /dft in hadoop file system first.

hdfs dfs -mkdir  /dft

Check the contents of your directory:

 hdfs dfs -ls /dft 
 hdfs dfs -cat /dft/4300.txt

Please edit /home/hadoop/hadoop/etc/hadoop/mapred-site.xml

to contain the following path (classpath):

<!-- Put site-specific property overrides in this file. --> 
<configuration>  
<property>   <name>mapreduce.framework.name</name>  
  <value>yarn</value> 
 </property>
 <property>     <name>mapreduce.application.classpath</name>     <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/common/*,$HADOOP_MAPRED_HOME/share/hadoop/common/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/lib/*</value> </property>
 <property>  <name>yarn.app.mapreduce.am.env</name> 
 <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
 </property>
 <property>  <name>mapreduce.map.env</name>  
 <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> 
 </property>
 <property>  <name>mapreduce.reduce.env</name> 
 <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> 
</property> 
</configuration>

Now run your program:

cd /home/hadoop/dft 

 hadoop jar ./hadoop-mapreduce-examples-3.2.1.jar wordcount /dft  /dft-output

You will see output:

15/04/25 17:34:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/04/25 17:34:28 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/04/25 17:34:31 INFO input.FileInputFormat: Total input paths to process : 1
15/04/25 17:34:31 INFO mapreduce.JobSubmitter: number of splits:1
15/04/25 17:34:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1429946598372_0001
15/04/25 17:34:32 INFO impl.YarnClientImpl: Submitted application application_1429946598372_0001
15/04/25 17:34:32 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1429946598372_0001/
15/04/25 17:34:32 INFO mapreduce.Job: Running job: job_1429946598372_0001
^C[hadoop@localhost ~]$ hadoop jar ./hadoop-mapreduce-examples-2.6.jar wordcount /dft /dft-output
15/04/25 17:34:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/04/25 17:34:50 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/04/25 17:34:55 INFO input.FileInputFormat: Total input paths to process : 1
15/04/25 17:34:56 INFO mapreduce.JobSubmitter: number of splits:1
15/04/25 17:34:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1429946598372_0002
15/04/25 17:34:57 INFO impl.YarnClientImpl: Submitted application application_1429946598372_0002
15/04/25 17:34:57 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1429946598372_0002/
15/04/25 17:34:57 INFO mapreduce.Job: Running job: job_1429946598372_0002
15/04/25 17:35:59 INFO mapreduce.Job: Job job_1429946598372_0002 running in uber mode : false
15/04/25 17:35:59 INFO mapreduce.Job:  map 0% reduce 0%
15/04/25 17:36:12 INFO mapreduce.Job:  map 100% reduce 0%
15/04/25 17:36:27 INFO mapreduce.Job:  map 100% reduce 100%
15/04/25 17:36:28 INFO mapreduce.Job: Job job_1429946598372_0002 completed successfully
15/04/25 17:36:28 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=725025
		FILE: Number of bytes written=1661195
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=1573143
		HDFS: Number of bytes written=527522
		HDFS: Number of read operations=6
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=1
		Launched reduce tasks=1
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=10875
		Total time spent by all reduces in occupied slots (ms)=11743
		Total time spent by all map tasks (ms)=10875
		Total time spent by all reduce tasks (ms)=11743
		Total vcore-seconds taken by all map tasks=10875
		Total vcore-seconds taken by all reduce tasks=11743
		Total megabyte-seconds taken by all map tasks=11136000
		Total megabyte-seconds taken by all reduce tasks=12024832
	Map-Reduce Framework
		Map input records=33055
		Map output records=267975
		Map output bytes=2601773
		Map output materialized bytes=725025
		Input split bytes=99
		Combine input records=267975
		Combine output records=50091
		Reduce input groups=50091
		Reduce shuffle bytes=725025
		Reduce input records=50091
		Reduce output records=50091
		Spilled Records=100182
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=323
		CPU time spent (ms)=4940
		Physical memory (bytes) snapshot=269115392
		Virtual memory (bytes) snapshot=4150345728
		Total committed heap usage (bytes)=137498624
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=1573044
	File Output Format Counters 
		Bytes Written=527522

see the output in hadoop directory:

 hdfs dfs -ls / 
 hdfs dfs -ls /dft-output

Finally, let's take a look at the output

 hdfs dfs -cat /dft-output/part-r-00000 | less

If we wanted to copy the output file to our local storage (remember, the output is automatically created in the HDFS world, and we have to copy the data from there to our file system to work on it):

 hdfs dfs -copyToLocal /dft-output/part-r-00000 .

To remove the output directory from hdfs

 hdfs dfs -rm -r /dft-output

Note: The hadoop program WordCount will not run another time if the output directory exists. It always wants to create a new one, so we'll have to remove the output directory regularly after having saved the output of each job.

You may want to start job history server:

mr-jobhistory-daemon.sh start historyserver

Compile Wordcount

ref: (https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Source+Code)

(http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v1.0)

Make sure you setup the following in .bashrc Add the following to .bashrc

export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export JAVA_HOME="$(jrunscript -e 'java.lang.System.out.println(java.lang.System.getProperty("java.home"));')"
export PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
export CLASSPATH=./
export CLASSPATH=$CLASSPATH:`hadoop classpath`:.:
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar

Then compile WordCount.java

mkdir wordcount_classes
javac -d wordcount_classes  WordCount.java 
 jar cvf wc.jar -C  wordcount_classes/ .

Run the program

hadoop jar  wc.jar WordCount /dft  /dft-output2

Try other examples: (/home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar)

For running hadoop's pi example: (wget https://code.google.com/p/mrs-mapreduce/source/browse/examples/pi/Pi.java)

Enter below command:

 hadoop jar hadoop-mapreduce-examples-3.2.1.jar  pi 10 10

You can try to see th web resource manager: (While running the job, see what is going on?)

http://<ip>:8088/cluster/nodes

Note: last two are arguments which define numbers of map task and reduce task respectively. We can specify any no.'s here. For running hadoop's sudoku example: Enter below command:

 bin/hadoop jar hadoop-mapreduce-examples-3.2.1.jar sudoku puzzle1.dta

Hadoop Wordcount Java - cchantra/bigdata.github.io GitHub Wiki

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️