Hadoop Installation (Multinode) - cchantra/bigdata.github.io GitHub Wiki

(Adopt from https://docs.google.com/document/d/1L2XFx3BJYri5lAu-CB8pBJn6TsiLPrgfKVx7KYHtSsU/edit?usp=share_link) (https://www.linode.com/docs/guides/how-to-install-and-set-up-hadoop-cluster/)

And combine with

(https://medium.com/@jootorres_11979/how-to-set-up-a-hadoop-3-2-1-multi-node-cluster-on-ubuntu-18-04-2-nodes-567ca44a3b12)

  1. Create hostfile for both nodes.

Set up the hostname and ip

create a host file containing all nodes on the master and slave

sudo vi /etc/hosts

Add the following lines in the file.

10.3.135.170  server1  # assume this one is your master ip
10.3.135.169  server2  # assume  this one is your slave ip

Screen Shot 2568-01-05 at 20 59 07

Create User for Hadoop for both nodes and make paswordless ssh

We recommend creating a normal (nor root) account for Hadoop working. To create an account using the following command.

sudo adduser hadoop

Set password for the hadoop user

sudo passwd hadoop

After created, make the user sudoer.

sudo usermod -aG sudo hadoop

Create ssh key and make passwordless

After creating the account, it also requires to set up key-based ssh to its own account. To do this, execute following commands. On the master node,

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Now, SSH to localhost with Hadoop user. This should not ask for the password but the first time it will prompt for adding RSA to the list of known hosts.

you should be able to ssh to other nodes as well.

First, copy ssh key to every of nodes including itself. (10.3.135.170) Assume your slave node is 10.3.135.169 using the command. Eg.

ssh-copy-id -i ~/.ssh/id_rsa.pub   hadoop@server2
ssh-copy-id -i ~/.ssh/id_rsa.pub   hadoop@server1

Then you should be able to ssh to your slave without password. Eg.

ssh server2
Screen Shot 2568-01-05 at 21 00 16
  1. Requirement: Java

You have to install Java first. But update your VM. Do for both VM sudo apt update Install java 8.

sudo apt install openjdk-8-jdk
 
  1. Install hadoop Download the tarball. Assume we use version 3.2.1
wget  https://archive.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

(you can check the latest one in https://downloads.apache.org/hadoop/core/)

tar xzf hadoop-3.2.1.tar.gz
mv hadoop-3.2.1 hadoop

Note: Current version is available at https://hadoop.apache.org/releases.html (3.4.1 as of Oct, 2024)

  1. Setup searching path for hadoop for master and slave

Edit .bashrc

vi ~/.bashrc

append the following environment variables.

export JAVA_HOME="$(jrunscript -e 'java.lang.System.out.println(java.lang.System.getProperty("java.home"));')"
export PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
export CLASSPATH=./
export CLASSPATH=$CLASSPATH:`hadoop classpath`:.:
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

apply the changes

source ~/.bashrc

setup java path on both master and all slaves

change hadoop environment to add JAVA_HOME

vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
export JAVA_HOME="$(jrunscript -e 'java.lang.System.out.println(java.lang.System.getProperty("java.home"));')"

Next, set up hadoop configuration file, on master

cd $HADOOP_HOME/etc/hadoop

Edit core-site.xml add the following. Point datanode to the master one.

<configuration>
<property>
  <name>fs.default.name</name>
    <value>hdfs://server1:9000</value>
</property>

</configuration>

Edit hdfs-site.xml create two nodes. Others are location of data node and name node storage.

<configuration>
<property>
 <name>dfs.replication</name>
 <value>2</value>
</property>

<property>
  <name>dfs.name.dir</name>
    <value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>

<property>
  <name>dfs.data.dir</name>
    <value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>

Edit mapred-site.xml

<configuration>
 <property>
  <name>mapreduce.framework.name</name>
   <value>yarn</value>
 </property>

<property>
    <name>yarn.app.mapreduce.am.env</name>
     <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
    <name>mapreduce.map.env</name>
 	<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
    <property>
      <name>mapreduce.reduce.env</name>
      <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
    </property>
 <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
 </property>

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
 </property>
<!-- below is not necessary –>
 <property>
 <name>mapreduce.map.memory.mb</name>
<value>4096</value>
</property>
 <property>
 <name>mapreduce.reduce.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1024M</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx2560M</value>
</property>

</configuration>

On the master, add worker to

vi $HADOOP_HOME/etc/hadoop/workers

Add the name of your slave

Screen Shot 2568-01-05 at 21 19 25

Copy the master configuration to all slaves.

scp $HADOOP_HOME/etc/hadoop/* server2:$HADOOP_HOME/etc/hadoop/
Screen Shot 2568-01-05 at 21 21 32

On master, format name node.

 hdfs namenode -format
Screen Shot 2568-01-05 at 21 21 57 Screen Shot 2568-01-05 at 21 22 21
  1. start hdfs file system on master.
start-dfs.sh

Run jps to check.

Screen Shot 2568-01-05 at 21 22 40

On the slave, also do jps

Screen Shot 2568-01-05 at 21 23 01

Check on browser port 9870

Since it is behind the firewall. We will open firewall for master.


sudo ufw allow 9870/tcp

And ssh tunnel to allow the port on your local computer to access through web browser of your local computer.

ssh tunnel to open your local web browser.

ssh -N -L 9870:10.3.135.170:9870 [email protected] -vv

Goto localhost:9870 You should see 2 live nodes.

Screen Shot 2568-01-05 at 21 23 33 Screen Shot 2568-01-05 at 21 24 01

Next change the yarn-site.xml of slave node. Make it point to the master node.

vi hadoop/etc/hadoop/yarn-site.xml
<configuration>
 <property>
  <name>yarn.resourcemanager.hostname</name>
 <value>server1</value>
</property>
 <property>
  <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
 </property>
<property>
   <name>yarn.nodemanager.vmem-check-enabled</name>
   <value>false</value>
</property>

</configuration>

You may change other configuration of yarn-site.xml as following: (optional)

https://www.linode.com/docs/guides/how-to-install-and-set-up-hadoop-cluster/#sample-configuration-for-2gb-nodes

On master node, start yarn. Type

start-yarn.sh
Screen Shot 2568-01-05 at 21 24 19

Next, check using jps. You should have five elements: Namenode, ResourceManager, SecondaryNode, NodeManager, DataNode. If not stop, it using stop-yarn.sh And check error log.

If you get stuck, when running mapreduce, due to some lock problem. See the log in http://localhost:8088/stacks for the resource manager 8088.

Edit core-site.xml of the slave node, add the following. Point datanode to the another data node .

<configuration>
<property>
  <name>fs.default.name</name>
    <value>hdfs://server2:9000</value>
</property>

</configuration>

If you get stuck, eg. map reduce not progressing, it may be the memory problem. Check your memory in resource manager. See this blog.

https://community.cloudera.com/t5/Support-Questions/map-reduce-stuck-at-0/m-p/106777

Screen Shot 2568-01-05 at 21 24 46

Note:

** Error log is at /home/hadoop/hadoop/logs/ Eg. where xxx is the name, e.g datanode, or any . Try check the name by using ls command.

 tail  /home/hadoop/hadoop/logs/hadoop-hadoop-xxxxx.log

If you jps on slave node, you should see two things, datanode, and nodemanager.

Screen Shot 2568-01-05 at 21 25 07

If success, since it is behind the firewall. We will open firewall for master. Yarn is at port 8088

sudo ufw allow 8088/tcp

And ssh tunnel to allow the port on your local computer to access through web browser of your local computer.

ssh tunnel

ssh -N -L 8088:10.3.135.170:8088 [email protected] -vv

Goto localhost:8088. You should see two active nodes now.

Screen Shot 2568-01-05 at 21 25 36

Check by command line

yarn node -list

Screen Shot 2568-01-05 at 21 25 59

yarn application -list

Screen Shot 2568-01-05 at 21 26 21

Testing a cluster

Get data set. Make directory and download data

mkdir txtdata
cd txtdata
wget http://corpus.canterbury.ac.nz/resources/cantrbry.zip
unzip cantrbry.zip

Check it

Screen Shot 2568-01-05 at 21 26 49

Copy data to hdfs. Create directory on hdfs with command, and copy


hdfs dfs -mkdir /txtdata

Screen Shot 2568-01-05 at 21 27 04
hdfs dfs -put txtdata/*  /txtdata

Screen Shot 2568-01-05 at 21 27 35

To submit the job,

yarn jar  ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount "/txtdata/*" output

Then check the yarn, look at the webbrowser. You should see 1 app submit

Screen Shot 2568-01-05 at 21 28 04 Screen Shot 2568-01-05 at 21 28 23 Screen Shot 2568-01-05 at 21 28 47

To rerun,


 hdfs dfs -rmdir output

Inspect the output. Obtain the file from hdfs and more it.

Screen Shot 2568-01-05 at 21 29 25
⚠️ **GitHub.com Fallback** ⚠️