Hadoop Ecosystem Installation - dennisholee/notes GitHub Wiki

Installation Guideline

Hadoop Pseudo Distribute Installation (all in one setup)

https://hadoop.apache.org/docs/r2.9.2/hadoop-project-dist/hadoop-common/SingleCluster.html

Configuration overview:

https://github.com/dennisholee/notes/blob/master/hadoop_config.svg

The exercise was done on a CentOS environment.

  1. Install JDK
sudo yum search jdk
sudo yum install java-1.8.0-openjdk-devel
  1. Setup SSH public-key authentication
# Generate key-pair using default values
ssh-keygen

# Append public key to authorized keys file
cd .ssh
cat id_rsa.pub >> authorized_keys

Confirm ssh via key is successful

ssh localhost
  1. Download Hadoop installation file
# https://hadoop.apache.org/releases.html
curl -O http://apache.website-solution.net/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz

# Unzip the bundle
tar xvf hadoop-2.9.2/hadoop-2.9.2.tar.gz
  1. Set JAVA_HOME
# Trace JRE HOME ...
which java # etc.

Update bash profile vim ~/.bash_profile

# Add JAVA_HOME
export JAVA_HOME={JRE_PATH}
  1. Update ${HADOOP_HOME}/etc/hadoop/hadoop-env.sh
# Locate the path for both variables
export JAVA_HOME={JAVA_HOME}
export HADOOP_CONF_DIR={HADOOP_HOME}/etc/hadoop
  1. Update etc/hadoop/core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
  1. Update etc/hadoop/hdfs-site.xml:
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
  1. Update memory setting according to machine spec

a. Download file Hadoop's companion files (https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_command-line-installation/content/download-companion-files.html)

wget http://public-repo-1.hortonworks.com/HDP/tools/2.6.0.3/hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
tar zxvf hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz

b. Run yarn-utils.py

cd hdp_manual_install_rpm_helper_files-2.6.0.3.8/scripts
python yarn-utils.py -c 2 -m 3 -d 1 false
 Using cores=2 memory=3GB disks=1 hbase=True
 Profile: cores=2 memory=2048MB reserved=0GB usableMem=1GB disks=1
 Num Container=3
 Container Ram=682MB
 Used Ram=1GB
 Unused Ram=0GB
 yarn.scheduler.minimum-allocation-mb=682
 yarn.scheduler.maximum-allocation-mb=2046
 yarn.nodemanager.resource.memory-mb=2046
 mapreduce.map.memory.mb=682
 mapreduce.map.java.opts=-Xmx545m
 mapreduce.reduce.memory.mb=1364
 mapreduce.reduce.java.opts=-Xmx1091m
 yarn.app.mapreduce.am.resource.mb=1364
 yarn.app.mapreduce.am.command-opts=-Xmx1091m
 mapreduce.task.io.sort.mb=272

Note: Helper tool to convert output to XML for pasting to config files:

# File properties.txt contains "yarn.scheduler.minimum-allocation-mb" to "mapreduce.task.io.sort.mb"
sed -e "s/\([^=\]*\)=\(.*\)/<property>\n  <name>\1<\/name>\n  <value>\2<\/value>\n<\/property>\n/g" properties.txt

c. Patch $HADOOP_HOME/etc/hadoop/yarn-site.xml with the values obtained from (b)

<configuration>
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>682</value>
  </property>

  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>2046</value>
  </property>

  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>2046</value>
  </property>

  <property>
    <name>yarn.acl.enable</name>
    <value>0</value>
  </property>

  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>localhost</value>
  </property>

  <property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>0.0.0.0:8088</value>
  </property>

  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>

<property>
         <name>yarn.log.server.url</name>
         <value>http://<LOG_SERVER_HOSTNAME>:19888/jobhistory/logs</value>
</property>
</configuration>
 mr-jobhistory-daemon.sh start historyserver

d. Patch $HADOOP_HOME/etc/hadoop/map-site.xml with the values obtained from (b)

<configuration>

  <property>
    <name>mapreduce.map.memory.mb</name>
    <value>682</value>
  </property>

  <property>
    <name>mapreduce.map.java.opts</name>
    <value>-Xmx545m</value>
  </property>

  <property>
    <name>mapreduce.reduce.memory.mb</name>
    <value>1364</value>
  </property>

  <property>
    <name>mapreduce.reduce.java.opts</name>
    <value>-Xmx1091m</value>
  </property>

  <property>
    <name>yarn.app.mapreduce.am.resource.mb</name>
    <value>1364</value>
  </property>

  <property>
    <name>yarn.app.mapreduce.am.command-opts</name>
    <value>-Xmx1091m</value>
  </property>

  <property>
    <name>mapreduce.task.io.sort.mb</name>
    <value>272</value>
  </property>

  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

e. Adjust $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml if containers can't be bootstrapped.

<configuration>
  <property>
    <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
    <!-- Defaults to 0.1 -->
    <value>0.5</value>
    <description>
      Maximum percent of resources in the cluster which can be used to run
      application masters i.e. controls number of concurrent running
      applications.
    </description
  <property>
</configuration>
  1. Format the filesystem:
./bin/hdfs namenode -format
  1. Start hadoop services
# NameNode daemon and DataNode daemon
./sbin/start-dfs.sh

# Yarn
./sbin/start-yarn.sh

# Job history server
./sbin/mr-jobhistory-daemon.sh start history server
  1. Test setup with word count job
cd $HADOOP_HOME
./bin/hadoop fs -put LICENSE.txt /user/{USER}/LICENSE.txt
./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar wordcount /user/cloud_user/LICENSE.txt

Installation errors

HDFS connection failed

./hadoop fs -ls /
ls: Call From dennis-VirtualBox/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

NameNode is not properly formatted ...

cd $HADOOP_HOME
sbin/stop-dfs.sh
bin/hdfs namenode -format
sbin/start-dfs.sh
jps # Check NameNode is running
netstat -tulpn | grep 9000 # Confirm port is opened. Netstat command may differ according to distro

Yarn auxiliary shuffle configuration undefined

Container launch failed for container_1549731309824_0001_01_000002 : org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:mapreduce_shuffle does not exist

Add implementation to $HADOOP_HOME/etc/hadoop/yarn-site.xml

 <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>

Spark Installation

  1. Install Scala https://www.scala-lang.org/download/
curl -O https://downloads.lightbend.com/scala/2.12.8/scala-2.12.8.rpm
sudo yum install scala-2.12.8.rpm
  1. Download Spark
# Download site: https://spark.apache.org/downloads.html
curl -O http://apache.01link.hk/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
  1. Update environment variables vi ~/.bash_profile
export HADOOP_CONF_DIR=/home/{user}/hadoop/etc/hadoop
export SPARK_HOME=/home/{user}/spark
export LD_LIBRARY_PATH=/home/{user}/hadoop/lib/native:$LD_LIBRARY_PATH

# Update python location according to environment setup
export PYSPARK_PYTHON=/usr/bin/python3
  1. Update Spark configuration
cd ${SPARK_HOME}/conf
cp spark-defaults.conf.template spark-defaults.conf

Update configuration ${SPARK_HOME}/conf/spark-defaults.conf

spark.master                       yarn
spark.driver.bindAddress           127.0.0.1
spark.driver.host                  127.0.0.1
spark.ui.port                      4040   
spark.yarn.jars                    hdfs:///user/spark/share/lib/*.jar

Note: Copy $SPARK_HOME/jars to HDFS:///user/spark/share/lib

# Hack to quickly upload the jars ...
# WARNING: moveFromLocal will delete or files in the local directory
cd $SPARK_HOME
cp -r jars jars.0
cd jars.0
hadoop fs -mkdir -p /user/spark/share/lib/
hadoop fs -moveFromLocal * /user/spark/share/lib/
cd ..
rmdir jars.0

https://stackoverflow.com/questions/49579156/hadoop-yarn-application-is-added-to-the-scheduler-and-is-not-yet-activated-sk

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_command-line-installation/content/download-companion-files.html

python yarn-utils.py -c 2 -m 3 -d 1 false
 Using cores=2 memory=3GB disks=1 hbase=True
 Profile: cores=2 memory=2048MB reserved=0GB usableMem=1GB disks=1
 Num Container=3
 Container Ram=682MB
 Used Ram=1GB
 Unused Ram=0GB
 yarn.scheduler.minimum-allocation-mb=682
 yarn.scheduler.maximum-allocation-mb=2046
 yarn.nodemanager.resource.memory-mb=2046
 mapreduce.map.memory.mb=682
 mapreduce.map.java.opts=-Xmx545m
 mapreduce.reduce.memory.mb=1364
 mapreduce.reduce.java.opts=-Xmx1091m
 yarn.app.mapreduce.am.resource.mb=1364
 yarn.app.mapreduce.am.command-opts=-Xmx1091m
 mapreduce.task.io.sort.mb=272

sed -e "s/([^=])=(.)/\n \1</name>\n \2</value>\n</property>\n/g" x

./hadoop fs -rm -r /user/cloud_user/.sparkStaging/*
for a in $(./yarn application -list | egrep "^application" | cut -d ' ' -f1); do ./yarn application -kill $a; done

Installation Errors

Error: java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.

cd ${SPARK_HOME}/conf
cp spark-env.sh.template spark-env.sh

# Update file spark-env.sh
SPARK_LOCAL_IP=127.0.0.1

Exception in thread "main" java.net.ConnectException: Call From {hostname}/{host ip} to 0.0.0.0:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused

# Check services is running ...
> jps # Sample output 
9248 SparkSubmit
8705 ResourceManager
8546 SecondaryNameNode
9018 NodeManager
8283 DataNode
8062 NameNode
9406 Jps

If Hadoop services are not running then start process follows:

cd ${HADOOP_HOME}
./sbin/start-hdfs.sh
./sbin/start-yarn.sh

Python not found when executing spark-submit

Exception in thread "main" java.io.IOException: Cannot run program "python": error=2, No such file or directory

Define environment variable PYSPARK_PYTHON

# ~/.bashrc
export PYSPARK_PYTHON={PYTHON_COMMAND}

To find python command which python

Pig Installation

  1. Download Pig installation http://www.apache.org/dyn/closer.cgi/pig
sudo curl -O http://apache.communilink.net/pig/pig-0.17.0/pig-0.17.0.tar.gz
tar xvf pig-0.17.0.tar.gz

Setup summary

Bash profile (~/.bash_profile)

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-1.el7_6.x86_64/jre

export HADOOP_HOME=/home/{user}/hadoop
export HADOOP_CONF_DIR=/home/{user}/hadoop/etc/hadoop
export SPARK_HOME=/home/{user}/spark
export PYSPARK_PYTHON=/usr/bin/python3
export LD_LIBRARY_PATH=/home/{user}/hadoop/lib/native:$LD_LIBRARY_PATH
export PIG_HOME=/home/{user}/pig

PATH=$PATH:$HOME/.local/bin:$HOME/bin
PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$PIG_HOME/bin:$PATH


Common Usage

HDFS

  • Start DFS ./sbin/start-dfs.sh

  • List DFS files ./bin/hdfs dfs -ls

YARN

./bin/yarn jar {JAR_PATH} {app} ...

# Example word mean ...
./bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar wordmean LICENSE.txt license_avg_count

Use cases

Loading data using PIG

Data source: https://www.kaggle.com/spscientist/students-performance-in-exams

# Upload data file to HDFS
hdfs dfs -put StudentsPerformance.csv

# Load data to PIG
student = LOAD 'StudentsPerformance.csv' USING PigStorage(',') as ( gender:chararray, race:chararray, parental_education:chararray, lunch:chararray, prep_course:chararray, math_score:chararray, reading_score:chararray, writing_score:chararray );

# Store data to HDFS
STORE student INTO 'student_Output ' USING PigStorage (',');

# Show data
Dump student

Sample Map Reduce Tools

Library "hadoop-mapreduce-examples-2.9.2.jar"

./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar 
An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.
  wordmean: A map/reduce program that counts the average length of the words in the input files.
  wordmedian: A map/reduce program that counts the median length of the words in the input files.
  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

Miscellaneous

What's the difference between “hadoop fs” shell commands and “hdfs dfs” shell commands?

https://stackoverflow.com/questions/18142960/whats-the-difference-between-hadoop-fs-shell-commands-and-hdfs-dfs-shell-co

Following are the three commands which appears same but have minute differences

hadoop fs {args}
hadoop dfs {args}
hdfs dfs {args}

hadoop fs

FS relates to a generic file system which can point to any file systems like local, HDFS etc. So this can be used when you are dealing with different file systems such as Local FS, (S)FTP, S3, and others

hadoop dfs

dfs is very specific to HDFS. would work for operation relates to HDFS. This has been deprecated and we should use hdfs dfs instead.

hdfs dfs

same as 2nd i.e would work for all the operations related to HDFS and is the recommended command instead of hadoop dfs

below is the list categorized as hdfs commands.

namenode|secondarynamenode|datanode|dfs|dfsadmin|fsck|balancer|fetchdt|oiv|dfsgroups

So even if you use hadoop dfs , it will look locate hdfs and delegate that command to hdfs dfs

⚠️ **GitHub.com Fallback** ⚠️