Hadoop: Setting up multiple node cluster - mikec964/chelmbigstock GitHub Wiki

Setting up multiple node cluster

Content

Prerequisite

  • Hadoop 1.2.1 is installed
  • the sample code has run successfully

Back to top

Network setup

All nodes must have a unique host name. Otherwise, the namenode cannot recognized the datanodes.
If you are using Ubuntu on a memory stick, run
$ hostnamectl set-hostname _new-hostname_
to change your host name. You need to do this every time you start up Ubuntu because the change is not saved permanently.

We assign unique IP addresses to each node. Edit /etc/hosts and add IP addresses.
$ sudo gedit /etc/hosts

  1. All slave nodes must have the IP address of the master node.
    192.168.1.127 suizei
  2. By default, each machine has an entry 127.0.1.1 its_own_hostname. Comment that out and add the assigned IP address.
    ex)
#127.0.1.1    mikea
192.168.1.108 mikea

Back to top

Set HDFS data directory

Move HDFS data directory and Hadoop temp directory to the home directory so that you can reset HDFS easily.

  • Add the code in conf/core-site.xml
<property>
   <name>hadoop.tmp.dir</name>
   <value>/home/hadoop/dfstemp</value>
</property>
  • Add the code in conf/hdfs-site.xml
<property>
   <name>dfs.data.dir</name>
   <value>/home/hadoop/dfs</value>
</property>

Back to top

Rebuild HDFS

Delete the content of HDFS completely and reformat it. (Don’t do this for production deployment!)
Each time we change the hadoop configuratio, we must reset HDFS on all nodes.

$ rm -rf ~/dfs
$ rm -rf ~/dfstemp
$ mkdir ~/dfs
$ mkdir ~/dfstemp
$ chmod 755 ~/dfs
$ chmod 755 ~/dfstemp
$ bin/hadoop namenode -format

Back to top

Configure Hadoop for multiple nodes

  • Choose the fastest machine as the master node. Here, assume the name is suizei
  • Change the code in conf/hdfs-site.xml
    For the single node configuration, we set the number of replicas to 1 because we have only 1 node. Now we are going to have multiple nodes, set the number of replicas to default (3).
<property>
   <name>dfs.replication</name>
   <value>3</value>
</property>
  • Edit conf/core-site.xml
    Set namenode (HDFS master) to the master node (suizei). For the single node configuration, it is set to localhost, so replace it with suizei
    We can use any port number which doesn't conflict other services but all nodes must have the same port number.
<property>
   <name>fs.default.name</name>
   <value>hdfs://suizei:54310</value>
</property>
  • Edit conf/mapred-site.xml
    Set jobtracker (MapReduce master) to the master node (suizei). For the single node configuration, it is set to local host, so replace it with suizei.
    We can use any port number which doesn't conflict other services but all nodes must have the same port number.
<property>
   <name>mapred.job.tracker</name>
   <value>suizei:54311</value>
</property>

Back to top

Run Hadoop in multi node mode

Master node Slave node
Start master node
$ bin/start-all.sh
Copy data to HDFS
$ bin/hadoop dfs -mkdir input
$ bin/hadoop dfs -copyFromLocal weather_data input
See what’s going on
NameNode status - http://suizei:50070/
JobTracker status - http://suizei:50030/
Start slave node
$ bin/hadoop-daemon.sh start datanode
$ bin/hadoop-daemon.sh start tasktracker
See what's going on
NameNode - http://suizei:50070/
JobTracker status - http://suizei:50030/
Run sample program
$ bin/hadoop jar playground/WeatherJob.jar WeatherJob input output
See what's going on
NameNode - http://suizei:50030/
Get result
$ bin/hadoop dfs -getmerge output/ weatherJobOutput.txt

Back to top

Stop Hadoop

  • master node
    $ bin/stop-all.sh
  • slave node
    $ bin/hadoop-daemon.sh stop tasktracker $ bin/hadoop-daemon.sh stop datanode

Back to top


Configuration variations

Tweek Hadoop configurations and see how they affect the performance.

Run Hadoop with smaller block size

See how the HDFS block size affects the performance.

  • Stop Hadoop
  • Add the code in conf/hdfs-site.xml on each node
    This changes the block size of HDFS from default (64MiB) to 4MiB.
<property>
   <name>dfs.block.size</name>
   <value>4194304</value>
</property>
  • Rebuild HDFS
  • Run Hadoop in multi node mode again

Back to top

Run Hadoop with more reducer/larger max map tasks

A CPU usally has multiple cores. If a machine has 4 cores, for example, it is a good idea to run 4 tasks concurrently.

  • Stop Hadoop
  • Remove the code in conf/hdfs-site.xml which was added in the previous seciton.
<property>
   <name>dfs.block.size</name>
   <value>4194304</value>
</property>
  • Add the code in conf/mapred-site.xml on each node
    They are the number of map tasks / reduce tasks which run simultaneously in the node. Each node can have different values. For example, an 8-core machine can run 8 tasks at the same time. An 2-core machine can run only 2 tasks.
<property>
   <name>mapred.tasktracker.map.tasks.maximum</name>
   <value>4</value>
</property>
<property>
   <name>mapred.reduce.tasks</name>
   <value>4</value>
</property>
  • Rebuild HDFS
  • Run Hadoop in multi node mode again

Back to top

⚠️ **GitHub.com Fallback** ⚠️