Setting up multiple node cluster

Content

Prerequisite
Network setup
Set HDFS data directory
Rebuild HDFS
Configure Hadoop for multiple nodes
Run Hadoop in multi node mode
Stop Hadoop
Configuration variations
- Run Hadoop with smaller block size
- Run Hadoop with more reducer/larger max map tasks

Prerequisite

Hadoop 1.2.1 is installed
the sample code has run successfully

Network setup

All nodes must have a unique host name. Otherwise, the namenode cannot recognized the datanodes.
If you are using Ubuntu on a memory stick, run
$ hostnamectl set-hostname _new-hostname_
to change your host name. You need to do this every time you start up Ubuntu because the change is not saved permanently.

We assign unique IP addresses to each node. Edit /etc/hosts and add IP addresses.
$ sudo gedit /etc/hosts

All slave nodes must have the IP address of the master node.
192.168.1.127 suizei
By default, each machine has an entry 127.0.1.1 its_own_hostname. Comment that out and add the assigned IP address.
ex)

#127.0.1.1    mikea
192.168.1.108 mikea

Set HDFS data directory

Move HDFS data directory and Hadoop temp directory to the home directory so that you can reset HDFS easily.

Add the code in conf/core-site.xml

<property>
   <name>hadoop.tmp.dir</name>
   <value>/home/hadoop/dfstemp</value>
</property>

Add the code in conf/hdfs-site.xml

<property>
   <name>dfs.data.dir</name>
   <value>/home/hadoop/dfs</value>
</property>

Rebuild HDFS

Delete the content of HDFS completely and reformat it. (Don’t do this for production deployment!)
Each time we change the hadoop configuratio, we must reset HDFS on all nodes.

$ rm -rf ~/dfs
$ rm -rf ~/dfstemp
$ mkdir ~/dfs
$ mkdir ~/dfstemp
$ chmod 755 ~/dfs
$ chmod 755 ~/dfstemp
$ bin/hadoop namenode -format

Configure Hadoop for multiple nodes

Choose the fastest machine as the master node. Here, assume the name is suizei
Change the code in conf/hdfs-site.xml
For the single node configuration, we set the number of replicas to 1 because we have only 1 node. Now we are going to have multiple nodes, set the number of replicas to default (3).

<property>
   <name>dfs.replication</name>
   <value>3</value>
</property>

Edit conf/core-site.xml
Set namenode (HDFS master) to the master node (suizei). For the single node configuration, it is set to localhost, so replace it with suizei
We can use any port number which doesn't conflict other services but all nodes must have the same port number.

<property>
   <name>fs.default.name</name>
   <value>hdfs://suizei:54310</value>
</property>

Edit conf/mapred-site.xml
Set jobtracker (MapReduce master) to the master node (suizei). For the single node configuration, it is set to local host, so replace it with suizei.
We can use any port number which doesn't conflict other services but all nodes must have the same port number.

<property>
   <name>mapred.job.tracker</name>
   <value>suizei:54311</value>
</property>

Run Hadoop in multi node mode

Master node	Slave node
Start master node `$ bin/start-all.sh`
Copy data to HDFS `$ bin/hadoop dfs -mkdir input $ bin/hadoop dfs -copyFromLocal weather_data input`
See what’s going on `NameNode status - http://suizei:50070/ JobTracker status - http://suizei:50030/`
	Start slave node `$ bin/hadoop-daemon.sh start datanode $ bin/hadoop-daemon.sh start tasktracker`
See what's going on `NameNode - http://suizei:50070/ JobTracker status - http://suizei:50030/`
Run sample program `$ bin/hadoop jar playground/WeatherJob.jar WeatherJob input output`
See what's going on `NameNode - http://suizei:50030/`
Get result `$ bin/hadoop dfs -getmerge output/ weatherJobOutput.txt`

Stop Hadoop

master node
$ bin/stop-all.sh
slave node
$ bin/hadoop-daemon.sh stop tasktracker $ bin/hadoop-daemon.sh stop datanode

Configuration variations

Tweek Hadoop configurations and see how they affect the performance.

Run Hadoop with smaller block size

See how the HDFS block size affects the performance.

Stop Hadoop
Add the code in conf/hdfs-site.xml on each node
This changes the block size of HDFS from default (64MiB) to 4MiB.

<property>
   <name>dfs.block.size</name>
   <value>4194304</value>
</property>

Rebuild HDFS
Run Hadoop in multi node mode again

Run Hadoop with more reducer/larger max map tasks

A CPU usally has multiple cores. If a machine has 4 cores, for example, it is a good idea to run 4 tasks concurrently.

Stop Hadoop
Remove the code in conf/hdfs-site.xml which was added in the previous seciton.

<property>
   <name>dfs.block.size</name>
   <value>4194304</value>
</property>

Add the code in conf/mapred-site.xml on each node
They are the number of map tasks / reduce tasks which run simultaneously in the node. Each node can have different values. For example, an 8-core machine can run 8 tasks at the same time. An 2-core machine can run only 2 tasks.

<property>
   <name>mapred.tasktracker.map.tasks.maximum</name>
   <value>4</value>
</property>
<property>
   <name>mapred.reduce.tasks</name>
   <value>4</value>
</property>

Rebuild HDFS
Run Hadoop in multi node mode again

Hadoop: Setting up multiple node cluster - mikec964/chelmbigstock GitHub Wiki

Setting up multiple node cluster

Content

Prerequisite

Network setup

Set HDFS data directory

Rebuild HDFS

Configure Hadoop for multiple nodes

Run Hadoop in multi node mode

Stop Hadoop

Configuration variations

Run Hadoop with smaller block size

Run Hadoop with more reducer/larger max map tasks

⚠️ GitHub.com Fallback ⚠️

Hadoop: Setting up multiple node cluster - mikec964/chelmbigstock GitHub Wiki

Setting up multiple node cluster

Content

Prerequisite

Network setup

Set HDFS data directory

Rebuild HDFS

Configure Hadoop for multiple nodes

Run Hadoop in multi node mode

Stop Hadoop

Configuration variations

Run Hadoop with smaller block size

Run Hadoop with more reducer/larger max map tasks

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️