Creating Hadoop Cluster on Ubuntu - Vishwajeetsinh98/random-forest-using-hadoop-mapreduce GitHub Wiki

Creating Hadoop Cluster on Ubuntu

In this page, we will describe how to create a 4 node Hadoop Cluster

1. Preparing the environment on Ubuntu

1.a. Update /etc/hosts

@on ALL nodes add this to /etc/hosts:

#master
192.168.10.135 master
#secondary name node
192.168.10.136 secondarymaster
#slave 1
192.168.10.140 slave1
#slave 2
192.168.10.141 slave2

Replace the IPs with those of your nodes

1.b. Create password-less SSH login between nodes

@on MASTER node:

# GENERATE DSA KEY-PAIR
ssh-keygen -t dsa -f ~/.ssh/id_dsa

# MAKE THE KEYPAIR TRUSTED
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

# COPY THIS KEY TO ALL OTHER NODES:
scp -r ~/.ssh  ubuntu@secondarymaster:~/
scp -r ~/.ssh  ubuntu@slave1:~/
scp -r ~/.ssh  ubuntu@slave2:~/
scp -r ~/.ssh  ubuntu@cassandra1:~/

1.c. Install Java

@ on ALL nodes

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update && sudo apt-get -y install oracle-java8-installer

1.d. Setup environmental variables

@ on ALL nodes

echo '
#HADOOP VARIABLES START
export HADOOP_PREFIX=/home/ubuntu/hadoop
export HADOOP_HOME=/home/ubuntu/hadoop
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export PATH=$PATH:$HADOOP_PREFIX/bin
export PATH=$PATH:$HADOOP_PREFIX/sbin
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
export HADOOP_OPTS="-Djava.library.path=${HADOOP_PREFIX}/lib/native"
export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
#HADOOP VARIABLES END
' >> ~/.bashrc
source ~/.bashrc

2. Download and Install Hadoop 2.7

@ on ALL nodes

#Download
wget http://apache.mirror.gtcomm.net/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz

#Extract
tar -xzvf ./hadoop-2.7.1.tar.gz

#Rename to target directory of /home/ubuntu/hadoop
mv hadoop-2.7.1 hadoop

#Create directory for HDFS filesystem
mkdir ~/hdfstmp

3. Setting Hadoop for first run

3.a. Edit core-site.xml

@on ALL nodes

<configuration>

<property>
<name>hadoop.tmp.dir</name>
  <value>/home/ubuntu/hdfstmp</value>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://master:8020</value>
</property>

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://master:8020</value>
</property>

</configuration>

3.b. Edit hdfs-site.xml

@ on ALL nodes

<configuration>
<property>
  <name>dfs.replication</name>
  <value>2</value>
</property>

<property>
  <name>dfs.permissions</name>
  <value>false</value>
</property>

<property>
  <name>dfs.namenode.secondary.http-address</name>
  <value>secondarymaster:50090</value>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://master:8020</value>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/home/ubuntu/hdfstmp/dfs/name/data</value>
  <final>true</final>
</property>
<property>
  <name>dfs.name.dir</name>
  <value>/home/ubuntu/hdfstmp/dfs/name</value>
  <final>true</final>
</property>

</configuration>

3.c. Edit mapred-site.xml

@ on ALL nodes

<configuration>
<property>
  <name>mapred.job.tracker</name>
  <value>hdfs://hadoopmaster:8021</value>
</property>
<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>
</configuration>

3.d. Edit yarn-site.xml

@ on ALL nodes

<configuration>

  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>master</value>
  </property>

  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
 
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>

</configuration>

3.e. Edit hadoop-env.sh

@ on ALL nodes change from:

export JAVA_HOME=${JAVA_HOME}

to:

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

@ on MASTER AND SECOONDARYMASTER, Edit slaves

slave1
slave2

That's it! You've setup the cluster for Hadoop to run on. In the next page, we will discuss how to test and run codes on the cluster.

⚠️ **GitHub.com Fallback** ⚠️