Hadoop: Setting up multiple node cluster - mikec964/chelmbigstock GitHub Wiki
- Prerequisite
- Network setup
- Set HDFS data directory
- Rebuild HDFS
- Configure Hadoop for multiple nodes
- Run Hadoop in multi node mode
- Stop Hadoop
- Configuration variations
- Hadoop 1.2.1 is installed
- the sample code has run successfully
All nodes must have a unique host name. Otherwise, the namenode cannot recognized the datanodes.
If you are using Ubuntu on a memory stick, run
$ hostnamectl set-hostname _new-hostname_
to change your host name. You need to do this every time you start up Ubuntu because the change is not saved permanently.
We assign unique IP addresses to each node. Edit /etc/hosts and add IP addresses.
$ sudo gedit /etc/hosts
- All slave nodes must have the IP address of the master node.
192.168.1.127 suizei
- By default, each machine has an entry
127.0.1.1 its_own_hostname
. Comment that out and add the assigned IP address.
ex)
#127.0.1.1 mikea
192.168.1.108 mikea
Move HDFS data directory and Hadoop temp directory to the home directory so that you can reset HDFS easily.
- Add the code in conf/core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/dfstemp</value>
</property>
- Add the code in conf/hdfs-site.xml
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/dfs</value>
</property>
Delete the content of HDFS completely and reformat it. (Don’t do this for production deployment!)
Each time we change the hadoop configuratio, we must reset HDFS on all nodes.
$ rm -rf ~/dfs
$ rm -rf ~/dfstemp
$ mkdir ~/dfs
$ mkdir ~/dfstemp
$ chmod 755 ~/dfs
$ chmod 755 ~/dfstemp
$ bin/hadoop namenode -format
- Choose the fastest machine as the master node. Here, assume the name is
suizei
- Change the code in conf/hdfs-site.xml
For the single node configuration, we set the number of replicas to 1 because we have only 1 node. Now we are going to have multiple nodes, set the number of replicas to default (3).
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
- Edit conf/core-site.xml
Set namenode (HDFS master) to the master node (suizei). For the single node configuration, it is set to localhost, so replace it with suizei
We can use any port number which doesn't conflict other services but all nodes must have the same port number.
<property>
<name>fs.default.name</name>
<value>hdfs://suizei:54310</value>
</property>
- Edit conf/mapred-site.xml
Set jobtracker (MapReduce master) to the master node (suizei). For the single node configuration, it is set to local host, so replace it with suizei.
We can use any port number which doesn't conflict other services but all nodes must have the same port number.
<property>
<name>mapred.job.tracker</name>
<value>suizei:54311</value>
</property>
Master node | Slave node |
---|---|
Start master node$ bin/start-all.sh
|
|
Copy data to HDFS$ bin/hadoop dfs -mkdir input
|
|
See what’s going on
NameNode status - http://suizei:50070/
|
|
Start slave node
$ bin/hadoop-daemon.sh start datanode
|
|
See what's going onNameNode - http://suizei:50070/
|
|
Run sample program
$ bin/hadoop jar playground/WeatherJob.jar WeatherJob input output
|
|
See what's going onNameNode - http://suizei:50030/
|
|
Get result
$ bin/hadoop dfs -getmerge output/ weatherJobOutput.txt
|
- master node
$ bin/stop-all.sh
- slave node
$ bin/hadoop-daemon.sh stop tasktracker
$ bin/hadoop-daemon.sh stop datanode
Tweek Hadoop configurations and see how they affect the performance.
See how the HDFS block size affects the performance.
- Stop Hadoop
- Add the code in conf/hdfs-site.xml on each node
This changes the block size of HDFS from default (64MiB) to 4MiB.
<property>
<name>dfs.block.size</name>
<value>4194304</value>
</property>
- Rebuild HDFS
- Run Hadoop in multi node mode again
A CPU usally has multiple cores. If a machine has 4 cores, for example, it is a good idea to run 4 tasks concurrently.
- Stop Hadoop
- Remove the code in conf/hdfs-site.xml which was added in the previous seciton.
<property>
<name>dfs.block.size</name>
<value>4194304</value>
</property>
- Add the code in conf/mapred-site.xml on each node
They are the number of map tasks / reduce tasks which run simultaneously in the node. Each node can have different values. For example, an 8-core machine can run 8 tasks at the same time. An 2-core machine can run only 2 tasks.
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>4</value>
</property>
- Rebuild HDFS
- Run Hadoop in multi node mode again