Hadoop Installation (Multinode) - cchantra/bigdata.github.io GitHub Wiki
(Adopt from https://docs.google.com/document/d/1L2XFx3BJYri5lAu-CB8pBJn6TsiLPrgfKVx7KYHtSsU/edit?usp=share_link) (https://www.linode.com/docs/guides/how-to-install-and-set-up-hadoop-cluster/)
And combine with
- Create hostfile for both nodes.
Set up the hostname and ip
create a host file containing all nodes on the master and slave
sudo vi /etc/hosts
Add the following lines in the file.
10.3.135.170 server1 # assume this one is your master ip
10.3.135.169 server2 # assume this one is your slave ip

Create User for Hadoop for both nodes and make paswordless ssh
We recommend creating a normal (nor root) account for Hadoop working. To create an account using the following command.
sudo adduser hadoop
Set password for the hadoop user
sudo passwd hadoop
After created, make the user sudoer.
sudo usermod -aG sudo hadoop
Create ssh key and make passwordless
After creating the account, it also requires to set up key-based ssh to its own account. To do this, execute following commands. On the master node,
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
Now, SSH to localhost with Hadoop user. This should not ask for the password but the first time it will prompt for adding RSA to the list of known hosts.
you should be able to ssh to other nodes as well.
First, copy ssh key to every of nodes including itself. (10.3.135.170) Assume your slave node is 10.3.135.169 using the command. Eg.
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@server2
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@server1
Then you should be able to ssh to your slave without password. Eg.
ssh server2

- Requirement: Java
You have to install Java first. But update your VM. Do for both VM sudo apt update Install java 8.
sudo apt install openjdk-8-jdk
- Install hadoop Download the tarball. Assume we use version 3.2.1
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
(you can check the latest one in https://downloads.apache.org/hadoop/core/)
tar xzf hadoop-3.2.1.tar.gz
mv hadoop-3.2.1 hadoop
Note: Current version is available at https://hadoop.apache.org/releases.html (3.4.1 as of Oct, 2024)
- Setup searching path for hadoop for master and slave
Edit .bashrc
vi ~/.bashrc
append the following environment variables.
export JAVA_HOME="$(jrunscript -e 'java.lang.System.out.println(java.lang.System.getProperty("java.home"));')"
export PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
export CLASSPATH=./
export CLASSPATH=$CLASSPATH:`hadoop classpath`:.:
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
apply the changes
source ~/.bashrc
setup java path on both master and all slaves
change hadoop environment to add JAVA_HOME
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
export JAVA_HOME="$(jrunscript -e 'java.lang.System.out.println(java.lang.System.getProperty("java.home"));')"
Next, set up hadoop configuration file, on master
cd $HADOOP_HOME/etc/hadoop
Edit core-site.xml add the following. Point datanode to the master one.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://server1:9000</value>
</property>
</configuration>
Edit hdfs-site.xml create two nodes. Others are location of data node and name node storage.
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
Edit mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- below is not necessary –>
<property>
<name>mapreduce.map.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1024M</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx2560M</value>
</property>
</configuration>
On the master, add worker to
vi $HADOOP_HOME/etc/hadoop/workers
Add the name of your slave

Copy the master configuration to all slaves.
scp $HADOOP_HOME/etc/hadoop/* server2:$HADOOP_HOME/etc/hadoop/

On master, format name node.
hdfs namenode -format


- start hdfs file system on master.
start-dfs.sh
Run jps to check.

On the slave, also do jps

Check on browser port 9870
Since it is behind the firewall. We will open firewall for master.
sudo ufw allow 9870/tcp
And ssh tunnel to allow the port on your local computer to access through web browser of your local computer.
ssh tunnel to open your local web browser.
ssh -N -L 9870:10.3.135.170:9870 [email protected] -vv
Goto localhost:9870 You should see 2 live nodes.


Next change the yarn-site.xml
of slave node. Make it point to the master node.
vi hadoop/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>server1</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>
You may change other configuration of yarn-site.xml as following: (optional)
On master node, start yarn. Type
start-yarn.sh

Next, check using jps. You should have five elements: Namenode, ResourceManager, SecondaryNode, NodeManager, DataNode. If not stop, it using stop-yarn.sh And check error log.
If you get stuck, when running mapreduce, due to some lock problem. See the log in http://localhost:8088/stacks for the resource manager 8088.
Edit core-site.xml of the slave node, add the following. Point datanode to the another data node .
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://server2:9000</value>
</property>
</configuration>
If you get stuck, eg. map reduce not progressing, it may be the memory problem. Check your memory in resource manager. See this blog.
https://community.cloudera.com/t5/Support-Questions/map-reduce-stuck-at-0/m-p/106777

Note:
** Error log is at /home/hadoop/hadoop/logs/ Eg. where xxx is the name, e.g datanode, or any . Try check the name by using ls command.
tail /home/hadoop/hadoop/logs/hadoop-hadoop-xxxxx.log
If you jps on slave node, you should see two things, datanode, and nodemanager.

If success, since it is behind the firewall. We will open firewall for master. Yarn is at port 8088
sudo ufw allow 8088/tcp
And ssh tunnel to allow the port on your local computer to access through web browser of your local computer.
ssh tunnel
ssh -N -L 8088:10.3.135.170:8088 [email protected] -vv
Goto localhost:8088
. You should see two active nodes now.

Check by command line
yarn node -list

yarn application -list

Testing a cluster
Get data set. Make directory and download data
mkdir txtdata
cd txtdata
wget http://corpus.canterbury.ac.nz/resources/cantrbry.zip
unzip cantrbry.zip
Check it

Copy data to hdfs. Create directory on hdfs with command, and copy
hdfs dfs -mkdir /txtdata

hdfs dfs -put txtdata/* /txtdata

To submit the job,
yarn jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount "/txtdata/*" output
Then check the yarn, look at the webbrowser. You should see 1 app submit



To rerun,
hdfs dfs -rmdir output
Inspect the output. Obtain the file from hdfs and more it.
