Hadoop installation (Multi Node Cluster) - dryshliak/hadoop GitHub Wiki

Prerequisites

Three node (min 1 Gb per node)
Disk space (min 30GB per node)
Ubuntu 14.04 or 16.04
Hadoop-2.6.5
Java 8
SSH access

Install VirtualBox https://www.virtualbox.org/wiki/Downloads
Prepare three instances (for example Ubuntu 14.04 LTS) appropriate version you can find by below URL
http://releases.ubuntu.com/14.04/ubuntu-14.04.6-server-amd64.iso http://releases.ubuntu.com/16.04/ubuntu-16.04.6-server-amd64.iso
During instance preparing add second adapter “Host-only Adapter” in the Network setting. You can face with recognition second adapter, to resolve this please read this article

Choose "OpenSSH server” to have ssh access to instance

On all instances you need to setup hosts files with FQDN names to resolve local DNS names on each node (as explained here) and also remove this line 127.0.1.1<---->"node name"
Disable firewall

sudo service ufw stop && sudo ufw disable

Before starting of installing any applications or software, please makes sure your list of packages from all repositories and PPA’s is up to date or if not update them by using this command:

sudo apt-get update && sudo apt-get dist-upgrade -y

Install Oracle Java JDK 1.8

sudo wget --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz -P /opt
#sudo wget http://ip-172-30-2-56.eu-west-1.compute.internal:82/jdk-8u131-linux-x64.tar.gz -P /opt
sudo mkdir -p /usr/lib/jvm
sudo tar -xf /opt/jdk-8u131-linux-x64.tar.gz -C /usr/lib/jvm
sudo ln -s /usr/lib/jvm/jdk1.8.0_131 /usr/lib/jvm/default-java
sudo update-alternatives --install /usr/bin/ControlPanel ControlPanel /usr/lib/jvm/default-java/bin/ControlPanel 1
sudo update-alternatives --install /usr/bin/extcheck extcheck /usr/lib/jvm/default-java/bin/extcheck 1
sudo update-alternatives --install /usr/bin/idlj idlj /usr/lib/jvm/default-java/bin/idlj 1
sudo update-alternatives --install /usr/bin/jar jar /usr/lib/jvm/default-java/bin/jar 1
sudo update-alternatives --install /usr/bin/jarsigner jarsigner /usr/lib/jvm/default-java/bin/jarsigner 1
sudo update-alternatives --install /usr/bin/java java /usr/lib/jvm/default-java/bin/java 1
sudo update-alternatives --install /usr/bin/javac javac /usr/lib/jvm/default-java/bin/javac 1
sudo update-alternatives --install /usr/bin/javadoc javadoc /usr/lib/jvm/default-java/bin/javadoc 1
sudo update-alternatives --install /usr/bin/javafxpackager javafxpackager /usr/lib/jvm/default-java/bin/javafxpackager 1
sudo update-alternatives --install /usr/bin/javah javah /usr/lib/jvm/default-java/bin/javah 1
sudo update-alternatives --install /usr/bin/javap javap /usr/lib/jvm/default-java/bin/javap 1
sudo update-alternatives --install /usr/bin/javaws javaws /usr/lib/jvm/default-java/bin/javaws 1
sudo update-alternatives --install /usr/bin/jcmd jcmd /usr/lib/jvm/default-java/bin/jcmd 1
sudo update-alternatives --install /usr/bin/jconsole jconsole /usr/lib/jvm/default-java/bin/jconsole 1
sudo update-alternatives --install /usr/bin/jcontrol jcontrol /usr/lib/jvm/default-java/bin/jcontrol 1
sudo update-alternatives --install /usr/bin/jdb jdb /usr/lib/jvm/default-java/bin/jdb 1
sudo update-alternatives --install /usr/bin/jdeps jdeps /usr/lib/jvm/default-java/bin/jdeps 1
sudo update-alternatives --install /usr/bin/jhat jhat /usr/lib/jvm/default-java/bin/jhat 1
sudo update-alternatives --install /usr/bin/jinfo jinfo /usr/lib/jvm/default-java/bin/jinfo 1
sudo update-alternatives --install /usr/bin/jjs jjs /usr/lib/jvm/default-java/bin/jjs 1
sudo update-alternatives --install /usr/bin/jmap jmap /usr/lib/jvm/default-java/bin/jmap 1
sudo update-alternatives --install /usr/bin/jmc jmc /usr/lib/jvm/default-java/bin/jmc 1
sudo update-alternatives --install /usr/bin/jps jps /usr/lib/jvm/default-java/bin/jps 1
sudo update-alternatives --install /usr/bin/jrunscript jrunscript /usr/lib/jvm/default-java/bin/jrunscript 1
sudo update-alternatives --install /usr/bin/jsadebugd jsadebugd /usr/lib/jvm/default-java/bin/jsadebugd 1
sudo update-alternatives --install /usr/bin/jstack jstack /usr/lib/jvm/default-java/bin/jstack 1
sudo update-alternatives --install /usr/bin/jstat jstat /usr/lib/jvm/default-java/bin/jstat 1
sudo update-alternatives --install /usr/bin/jstatd jstatd /usr/lib/jvm/default-java/bin/jstatd 1
sudo update-alternatives --install /usr/bin/jvisualvm jvisualvm /usr/lib/jvm/default-java/bin/jvisualvm 1
sudo update-alternatives --install /usr/bin/keytool keytool /usr/lib/jvm/default-java/bin/keytool 1
sudo update-alternatives --install /usr/bin/native2ascii native2ascii /usr/lib/jvm/default-java/bin/native2ascii 1
sudo update-alternatives --install /usr/bin/orbd orbd /usr/lib/jvm/default-java/bin/orbd 1
sudo update-alternatives --install /usr/bin/pack200 pack200 /usr/lib/jvm/default-java/bin/pack200 1
sudo update-alternatives --install /usr/bin/policytool policytool /usr/lib/jvm/default-java/bin/policytool 1
sudo update-alternatives --install /usr/bin/rmic rmic /usr/lib/jvm/default-java/bin/rmic 1
sudo update-alternatives --install /usr/bin/rmid rmid /usr/lib/jvm/default-java/bin/rmid 1
sudo update-alternatives --install /usr/bin/rmiregistry rmiregistry /usr/lib/jvm/default-java/bin/rmiregistry 1
sudo update-alternatives --install /usr/bin/schemagen schemagen /usr/lib/jvm/default-java/bin/schemagen 1
sudo update-alternatives --install /usr/bin/serialver serialver /usr/lib/jvm/default-java/bin/serialver 1
sudo update-alternatives --install /usr/bin/servertool servertool /usr/lib/jvm/default-java/bin/servertool 1
sudo update-alternatives --install /usr/bin/tnameserv tnameserv /usr/lib/jvm/default-java/bin/tnameserv 1
sudo update-alternatives --install /usr/bin/unpack200 unpack200 /usr/lib/jvm/default-java/bin/unpack200 1
sudo update-alternatives --install /usr/bin/wsgen wsgen /usr/lib/jvm/default-java/bin/wsgen 1
sudo update-alternatives --install /usr/bin/wsimport wsimport /usr/lib/jvm/default-java/bin/wsimport 1
sudo update-alternatives --install /usr/bin/xjc xjc /usr/lib/jvm/default-java/bin/xjc 1
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_131/

Check Java configuration

java –version

readlink -f /usr/bin/java | sed "s:bin/java::"

Copy Java path (for example JAVA_HOME="/usr/lib/jvm/jdk1.8.0_131") into:

sudo sh -c "echo "JAVA_HOME=/usr/lib/jvm/jdk1.8.0_131" >> /etc/environment"

After that create two more VM's (slaves) from existing (master) using VirtualBox clone
On the first node will be to generate a ssh key for password-less login between master and slave nodes

#Generate a ssh key for **ubuntu** user
ssh-keygen -t rsa -P ""
#Authorize the key to enable password less ssh  
cat /home/ubuntu/.ssh/id_rsa.pub >> /home/ubuntu/.ssh/authorized_keys  
chmod 600 /home/ubuntu/.ssh/authorized_keys
#Copy this key to node2 and node3 to enable password less ssh  
ssh-copy-id -i ~/.ssh/id_rsa.pub node2
ssh-copy-id -i ~/.ssh/id_rsa.pub node3
#Make sure you can do a password less ssh using following command.  
ssh node2
ssh node3

Installing Hadoop

Next step we'll visit the Apache Hadoop Releases page to find the most recent stable release. Follow the binary for the current release:

Download vesion 2.6.5 and Install Hadoop binaries on Master and Slave nodes

cd /home/ubuntu
wget https://archive.apache.org/dist/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz
tar xvf hadoop-2.6.5.tar.gz
mv hadoop-2.6.5 hadoop

Setup Hadoop Environment on Master and Slave Nodes

sudo su
echo 'HADOOP_HOME=/home/ubuntu/hadoop' >> /etc/environment
echo 'HADOOP_CONF_DIR=/home/ubuntu/hadoop/etc/hadoop' >> /etc/environment
echo 'export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin' >> /etc/bash.bashrc
exit

To check environment variable please relogin and try below

echo $HADOOP_HOME
echo $HADOOP_CONF_DIR

Update Configuration Files
Add/update core-site.xml on Master and Slave nodes with following options. Master and slave nodes should all be using the same value for this property fs.defaultFS, and should be pointing to master node only.

vi /home/ubuntu/hadoop/etc/hadoop/core-site.xml

<configuration>
<property>
  <name>hadoop.tmp.dir</name>
   <value>/dfs/tmp</value>
  <description>Temporary Directory</description>
</property>

<property>
  <name>fs.defaultFS</name>
   <value>hdfs://node1:54310</value>
  <description>Use HDFS as file storage engine</description>
</property>
</configuration>

Add/update mapred-site.xml on Master node with following options.

vi /home/ubuntu/hadoop/etc/hadoop/mapred-site.xml

<configuration>
<property>
  <name>mapreduce.jobtracker.address</name>
  <value>node1:54311</value>
  <description>The host and port that the MapReduce job tracker runs</description>
</property>
<property>
  <name>mapreduce.jobhistory.webapp.address</name>
  <value>node1:19888</value>
</property>
<property>
  <name>mapreduce.jobhistory.webapp.https.address</name>
  <value>node1:19890</value>
</property>
<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
  <description>The framework for running mapreduce jobs</description>
</property>
</configuration>

Add/update mapred-site.xml on Slave nodes with following options.

vi /home/ubuntu/hadoop/etc/hadoop/mapred-site.xml

<configuration>
<property>
  <name>mapreduce.jobtracker.address</name>
  <value>node1:54311</value>
  <description>The host and port that the MapReduce job tracker runs</description>
</property>
<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
  <description>The framework for running mapreduce jobs</description>
</property>
<property>
  <name>mapreduce.jobhistory.webapp.address</name>
  <value>node1:19888</value>
</property>
<property>
  <name>mapreduce.jobhistory.webapp.https.address</name>
  <value>node1:19890</value>
</property>
</configuration>

Add/update hdfs-site.xml on Master and Slave Nodes. We will be adding following three entries to the file.

dfs.replication – Here I am using a replication factor of 2. That means for every file stored in HDFS, there will be one redundant replication of that file on some other node in the cluster.
dfs.namenode.name.dir – This directory is used by Namenode to store its metadata file. Here i manually created this directory /hadoop-data/hadoopuser/hdfs/namenode on master and slave node, and use the directory location for this configuration.
dfs.datanode.data.dir – This directory is used by Datanode to store hdfs data blocks. Here i manually created this directory /hadoop-data/hadoopuser/hdfs/datanode on master and slave node, and use the directory location for this configuration.

Create NameNode folder on master and on each node DataNode directories

mkdir -p /dfs/nn
mkdir -p /dfs/dn
mkdir -p /dfs/tmp
chown -R ubuntu:ubuntu /dfs/nn
chown -R ubuntu:ubuntu /dfs/dn
chown -R ubuntu:ubuntu /dfs/tmp
vi /home/ubuntu/hadoop/etc/hadoop/hdfs-site.xml

<configuration>
<property>
 <name>dfs.replication</name>
  <value>2</value>
 <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
 </description>
</property>
<property>
 <name>dfs.namenode.name.dir</name>
  <value>/dfs/nn</value>
 <description>Determines where on the local filesystem the DFS name node should store the name table(fsimage)</description>
</property>
<property>
 <name>dfs.datanode.data.dir</name>
  <value>/dfs/dn</value>
 <description>Determines where on the local filesystem an DFS data node should store its blocks</description>
</property>
</configuration>

Add yarn-site.xml on Master node

vi /home/ubuntu/hadoop/etc/hadoop/yarn-site.xml

<configuration>
<property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
</property>
<property>
 <name>yarn.resourcemanager.scheduler.address</name>
 <value>node1:8030</value>
</property>.
<property>
 <name>yarn.resourcemanager.address</name>
 <value>node1:8032</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address</name>
  <value>node1:8088</value>
</property>
<property>
  <name>yarn.resourcemanager.resource-tracker.address</name>
  <value>node1:8031</value>
</property>
<property>	
  <name>yarn.resourcemanager.admin.address</name>
  <value>node1:8033</value>
</property>
<property>
  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
</property>
<property>
  <name>yarn.log-aggregation.retain-seconds</name>
  <value>2764800</value>
</property>
</configuration>

Add yarn-site.xml on Slave nodes

<configuration>
<property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
</property>
<property>
 <name>yarn.resourcemanager.scheduler.address</name>
 <value>node1:8030</value>
</property>.
<property>
 <name>yarn.resourcemanager.address</name>
 <value>node1:8032</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address</name>
  <value>node1:8088</value>
</property>
<property>
  <name>yarn.resourcemanager.resource-tracker.address</name>
  <value>node1:8031</value>
</property>
<property>	
  <name>yarn.resourcemanager.admin.address</name>
  <value>node1:8033</value>
</property>
<property>
  <name>yarn.log.server.url</name>
  <value>http://node1:19888/jobhistory/logs/</value>
</property>
<property>
  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
</property>
<property>
  <name>yarn.log-aggregation.retain-seconds</name>
  <value>2764800</value>
</property>
</configuration>

Add slaves file on Master node only. Add just name or ip addresses of master and all slave node. If file has an entry for localhost, you can remove that. This file is just helper file that are used by hadoop scripts to start appropriate services on master and slave nodes.

vi /home/ubuntu/hadoop/etc/hadoop/slaves
node1
node2
node3

Format the Namenode Before starting the cluster, we need to format the Namenode. Use the following command only on master node:

hdfs namenode -format

Start the HDFS (Distributed Format System)

Run the following on master node command to start the HDFS. You should see the output to make sure that it tries to start datanode on slave nodes one by one

/home/ubuntu/hadoop/sbin/start-dfs.sh

The output of this command should list NameNode, SecondaryNameNode, DataNode on master node, and DataNode on all slave nodes. If you don’t see the expected output, review the log files listed in Troubleshooting section.

su - ubuntu
jps

If you need stop HDFS demons you can always do this by:

/home/ubuntu/hadoop/sbin/stop-dfs.sh

Start the Yarn MapReduce, Job tracker and History server

Run the following command to start the Yarn MapReduce framework.

/home/ubuntu/hadoop/sbin/start-yarn.sh
/home/ubuntu/hadoop/sbin/mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR start historyserver

To validate the success, run jps command again on master nodes, and slave node.The output of this command should list NodeManager, ResourceManager on master node, and NodeManager, on all slave nodes. If you don’t see the expected output, review the log files listed in Troubleshooting section.

If you need stop YARN services:

/home/ubuntu/hadoop/sbin/stop-yarn.sh
/home/ubuntu/hadoop/sbin/mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR stop historyserver

Monitor Hadoop ResourseManage and Hadoop NameNode via web

If you wish to track Hadoop MapReduce and HDFS, you can look at Hadoop web view of ResourceManager and NameNode which are usually used by hadoop administrators. Open your default browser and visit the following links from any of the node.

For ResourceManager – http://node1:8088

web UI of the NameNode daemon – http://node1:50070

Troubleshooting the cluster:

By default, all logs for Hadoop gets stored in $HADOOP_HOME/logs. For any issue regarding installation, these logs will help to troubleshoot the cluster.