Hadoop Ecosystem Installation - dennisholee/notes GitHub Wiki
https://hadoop.apache.org/docs/r2.9.2/hadoop-project-dist/hadoop-common/SingleCluster.html
Configuration overview:
The exercise was done on a CentOS environment.
- Install JDK
sudo yum search jdk
sudo yum install java-1.8.0-openjdk-devel
- Setup SSH public-key authentication
# Generate key-pair using default values
ssh-keygen
# Append public key to authorized keys file
cd .ssh
cat id_rsa.pub >> authorized_keys
Confirm ssh via key is successful
ssh localhost
- Download Hadoop installation file
# https://hadoop.apache.org/releases.html
curl -O http://apache.website-solution.net/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz
# Unzip the bundle
tar xvf hadoop-2.9.2/hadoop-2.9.2.tar.gz
- Set JAVA_HOME
# Trace JRE HOME ...
which java # etc.
Update bash profile vim ~/.bash_profile
# Add JAVA_HOME
export JAVA_HOME={JRE_PATH}
- Update ${HADOOP_HOME}/etc/hadoop/hadoop-env.sh
# Locate the path for both variables
export JAVA_HOME={JAVA_HOME}
export HADOOP_CONF_DIR={HADOOP_HOME}/etc/hadoop
- Update etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
- Update etc/hadoop/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
- Update memory setting according to machine spec
a. Download file Hadoop's companion files (https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_command-line-installation/content/download-companion-files.html)
wget http://public-repo-1.hortonworks.com/HDP/tools/2.6.0.3/hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
tar zxvf hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
b. Run yarn-utils.py
cd hdp_manual_install_rpm_helper_files-2.6.0.3.8/scripts
python yarn-utils.py -c 2 -m 3 -d 1 false
Using cores=2 memory=3GB disks=1 hbase=True
Profile: cores=2 memory=2048MB reserved=0GB usableMem=1GB disks=1
Num Container=3
Container Ram=682MB
Used Ram=1GB
Unused Ram=0GB
yarn.scheduler.minimum-allocation-mb=682
yarn.scheduler.maximum-allocation-mb=2046
yarn.nodemanager.resource.memory-mb=2046
mapreduce.map.memory.mb=682
mapreduce.map.java.opts=-Xmx545m
mapreduce.reduce.memory.mb=1364
mapreduce.reduce.java.opts=-Xmx1091m
yarn.app.mapreduce.am.resource.mb=1364
yarn.app.mapreduce.am.command-opts=-Xmx1091m
mapreduce.task.io.sort.mb=272
Note: Helper tool to convert output to XML for pasting to config files:
# File properties.txt contains "yarn.scheduler.minimum-allocation-mb" to "mapreduce.task.io.sort.mb"
sed -e "s/\([^=\]*\)=\(.*\)/<property>\n <name>\1<\/name>\n <value>\2<\/value>\n<\/property>\n/g" properties.txt
c. Patch $HADOOP_HOME/etc/hadoop/yarn-site.xml with the values obtained from (b)
<configuration>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>682</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2046</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2046</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>0.0.0.0:8088</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://<LOG_SERVER_HOSTNAME>:19888/jobhistory/logs</value>
</property>
</configuration>
mr-jobhistory-daemon.sh start historyserver
d. Patch $HADOOP_HOME/etc/hadoop/map-site.xml with the values obtained from (b)
<configuration>
<property>
<name>mapreduce.map.memory.mb</name>
<value>682</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx545m</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>1364</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx1091m</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>1364</value>
</property>
<property>
<name>yarn.app.mapreduce.am.command-opts</name>
<value>-Xmx1091m</value>
</property>
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>272</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
e. Adjust $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml if containers can't be bootstrapped.
<configuration>
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<!-- Defaults to 0.1 -->
<value>0.5</value>
<description>
Maximum percent of resources in the cluster which can be used to run
application masters i.e. controls number of concurrent running
applications.
</description
<property>
</configuration>
- Format the filesystem:
./bin/hdfs namenode -format
- Start hadoop services
# NameNode daemon and DataNode daemon
./sbin/start-dfs.sh
# Yarn
./sbin/start-yarn.sh
# Job history server
./sbin/mr-jobhistory-daemon.sh start history server
- Test setup with word count job
cd $HADOOP_HOME
./bin/hadoop fs -put LICENSE.txt /user/{USER}/LICENSE.txt
./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar wordcount /user/cloud_user/LICENSE.txt
./hadoop fs -ls /
ls: Call From dennis-VirtualBox/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
NameNode is not properly formatted ...
cd $HADOOP_HOME
sbin/stop-dfs.sh
bin/hdfs namenode -format
sbin/start-dfs.sh
jps # Check NameNode is running
netstat -tulpn | grep 9000 # Confirm port is opened. Netstat command may differ according to distro
Container launch failed for container_1549731309824_0001_01_000002 : org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:mapreduce_shuffle does not exist
Add implementation to $HADOOP_HOME/etc/hadoop/yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
- Install Scala https://www.scala-lang.org/download/
curl -O https://downloads.lightbend.com/scala/2.12.8/scala-2.12.8.rpm
sudo yum install scala-2.12.8.rpm
- Download Spark
# Download site: https://spark.apache.org/downloads.html
curl -O http://apache.01link.hk/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
- Update environment variables
vi ~/.bash_profile
export HADOOP_CONF_DIR=/home/{user}/hadoop/etc/hadoop
export SPARK_HOME=/home/{user}/spark
export LD_LIBRARY_PATH=/home/{user}/hadoop/lib/native:$LD_LIBRARY_PATH
# Update python location according to environment setup
export PYSPARK_PYTHON=/usr/bin/python3
- Update Spark configuration
cd ${SPARK_HOME}/conf
cp spark-defaults.conf.template spark-defaults.conf
Update configuration ${SPARK_HOME}/conf/spark-defaults.conf
spark.master yarn
spark.driver.bindAddress 127.0.0.1
spark.driver.host 127.0.0.1
spark.ui.port 4040
spark.yarn.jars hdfs:///user/spark/share/lib/*.jar
Note: Copy $SPARK_HOME/jars to HDFS:///user/spark/share/lib
# Hack to quickly upload the jars ...
# WARNING: moveFromLocal will delete or files in the local directory
cd $SPARK_HOME
cp -r jars jars.0
cd jars.0
hadoop fs -mkdir -p /user/spark/share/lib/
hadoop fs -moveFromLocal * /user/spark/share/lib/
cd ..
rmdir jars.0
python yarn-utils.py -c 2 -m 3 -d 1 false
Using cores=2 memory=3GB disks=1 hbase=True
Profile: cores=2 memory=2048MB reserved=0GB usableMem=1GB disks=1
Num Container=3
Container Ram=682MB
Used Ram=1GB
Unused Ram=0GB
yarn.scheduler.minimum-allocation-mb=682
yarn.scheduler.maximum-allocation-mb=2046
yarn.nodemanager.resource.memory-mb=2046
mapreduce.map.memory.mb=682
mapreduce.map.java.opts=-Xmx545m
mapreduce.reduce.memory.mb=1364
mapreduce.reduce.java.opts=-Xmx1091m
yarn.app.mapreduce.am.resource.mb=1364
yarn.app.mapreduce.am.command-opts=-Xmx1091m
mapreduce.task.io.sort.mb=272
sed -e "s/([^=])=(.)/\n \1</name>\n \2</value>\n</property>\n/g" x
./hadoop fs -rm -r /user/cloud_user/.sparkStaging/*
for a in $(./yarn application -list | egrep "^application" | cut -d ' ' -f1); do ./yarn application -kill $a; done
Error: java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
cd ${SPARK_HOME}/conf
cp spark-env.sh.template spark-env.sh
# Update file spark-env.sh
SPARK_LOCAL_IP=127.0.0.1
Exception in thread "main" java.net.ConnectException: Call From {hostname}/{host ip} to 0.0.0.0:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
# Check services is running ...
> jps # Sample output
9248 SparkSubmit
8705 ResourceManager
8546 SecondaryNameNode
9018 NodeManager
8283 DataNode
8062 NameNode
9406 Jps
If Hadoop services are not running then start process follows:
cd ${HADOOP_HOME}
./sbin/start-hdfs.sh
./sbin/start-yarn.sh
Exception in thread "main" java.io.IOException: Cannot run program "python": error=2, No such file or directory
Define environment variable PYSPARK_PYTHON
# ~/.bashrc
export PYSPARK_PYTHON={PYTHON_COMMAND}
To find python command which python
- Download Pig installation http://www.apache.org/dyn/closer.cgi/pig
sudo curl -O http://apache.communilink.net/pig/pig-0.17.0/pig-0.17.0.tar.gz
tar xvf pig-0.17.0.tar.gz
Bash profile (~/.bash_profile)
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-1.el7_6.x86_64/jre
export HADOOP_HOME=/home/{user}/hadoop
export HADOOP_CONF_DIR=/home/{user}/hadoop/etc/hadoop
export SPARK_HOME=/home/{user}/spark
export PYSPARK_PYTHON=/usr/bin/python3
export LD_LIBRARY_PATH=/home/{user}/hadoop/lib/native:$LD_LIBRARY_PATH
export PIG_HOME=/home/{user}/pig
PATH=$PATH:$HOME/.local/bin:$HOME/bin
PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$PIG_HOME/bin:$PATH
-
Start DFS
./sbin/start-dfs.sh
-
List DFS files
./bin/hdfs dfs -ls
-
Start YARN
./sbin/start-yarn.sh
Default URL: http://localhost:8088 -
Execute Map Reduce job
./bin/yarn jar {JAR_PATH} {app} ...
# Example word mean ...
./bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar wordmean LICENSE.txt license_avg_count
Data source: https://www.kaggle.com/spscientist/students-performance-in-exams
# Upload data file to HDFS
hdfs dfs -put StudentsPerformance.csv
# Load data to PIG
student = LOAD 'StudentsPerformance.csv' USING PigStorage(',') as ( gender:chararray, race:chararray, parental_education:chararray, lunch:chararray, prep_course:chararray, math_score:chararray, reading_score:chararray, writing_score:chararray );
# Store data to HDFS
STORE student INTO 'student_Output ' USING PigStorage (',');
# Show data
Dump student
./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
Following are the three commands which appears same but have minute differences
hadoop fs {args}
hadoop dfs {args}
hdfs dfs {args}
FS relates to a generic file system which can point to any file systems like local, HDFS etc. So this can be used when you are dealing with different file systems such as Local FS, (S)FTP, S3, and others
dfs is very specific to HDFS. would work for operation relates to HDFS. This has been deprecated and we should use hdfs dfs instead.
same as 2nd i.e would work for all the operations related to HDFS and is the recommended command instead of hadoop dfs
below is the list categorized as hdfs commands.
namenode|secondarynamenode|datanode|dfs|dfsadmin|fsck|balancer|fetchdt|oiv|dfsgroups
So even if you use hadoop dfs , it will look locate hdfs and delegate that command to hdfs dfs