Hadoop - xuan103/class-2020-07 GitHub Wiki

Modules

  • The project includes these modules:

    • Hadoop Common: The common utilities that support the other Hadoop modules. (管理命令

    • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.(分散行檔案

    • Hadoop YARN: A framework for job scheduling and cluster resource management.(分散運算系統 [跑程式 執行程式

    • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.(資料庫引擎

    • Hadoop Ozone: An object store for Hadoop.(分散檔案系統(臭氧層

  • Hadoop 必要條件:

    • cluster(多台電腦)叢集

    • 作業系統:linux (穩定(ubuntu centos

    • 使用 java 語言 jde jre

  • Jre 是 java runtime environment:java 程序的運行環境

  • Jdk 是 java development kit:java 的開發工具包

陸軍:dtc (cluster) 實體電腦

海軍:dtk (install k8s.k3s) 實體電腦

空軍:dkc (docker compose) 虛擬電腦(雲端電腦)container


Sysinternals

118.163.47.126

$ nano ~/.bashrc

export PATH=/home/bigred/wk/cnt/dtc/bin:/home/bigred/wk/dtc/bin:$PATH

$ nano sysinfo

gw=$(route -n | grep -e "^0.0.0.0 ")
export GWIF=${gw##* }
ips=$(ifconfig $GWIF | grep 'inet ')
export IP=$(echo $ips | cut -d' ' -f2)
export NETID=${IP%.*}
export GW=$(route -n | grep -e '^0.0.0.0' | tr -s \ - | cut -d ' ' -f2)

echo "[`hostname`]"
echo "--------------------------------------------------------"

m=$(free -mh | grep Mem:)
echo -n "Memory : "
echo $m | cut -d' ' -f2

cn=$(cat /proc/cpuinfo | grep 'model name' | head -n 1 | cut -d ':' -f2 | tr -s ' ')
echo -n "CPU : $cn (core: "
cn=$(cat /proc/cpuinfo | grep 'model name' | wc -l)
echo "$cn)"

echo "IP Address : $IP"
echo "Default Gateway : $GW"
echo ""

java -version &> /tmp/java 
cat /tmp/java | head -n 1
echo ""

echo "/etc/hosts"
cat /etc/hosts | grep -E "^[0-9]{3}"

$ nano dt

#!/bin/bash
echo -e "CDT 20.08\n"

[ "$#" != 1 ] && echo 'dt [sysinfo | sysprep | build | list | restart]' && exit 1
c="sysinfo sysprep build list restart"
[ ! "${c}" =~ "$1" ](/xuan103/class-2020-07/wiki/-!-"${c}"-=~-"$1"-) && echo "Oops, wrong command" && exit 1

i=$(cat /etc/hosts | grep -E "mas|wka" | tr '\t' ' ' | cut -d' ' -f1)
c=$1
#echo $i

ps aux | grep -v grep | grep -o 'busybox httpd' &>/dev/null
[ "$?" != "0" ] && busybox httpd -p 8888 -h ~/wk/cdt/bin/cluster

case $c in
sysinfo)
    for x in $i
    do
      nc -w 2 -z $x 22
      [ "$?" != "0" ] && continue
      ssh $x 'wget -qO - http://192.168.66.253:8888/sysinfo | bash'
      echo ""
    done
    ;;
sysprep)
    for x in $i
    do
      nc -w 2 -z $x 22 &>/dev/null
      [ "$?" != "0" ] && continue
      ssh $x 'wget -qO - http://192.168.66.253:8888/sysprep | bash'
      echo ""
    done
    ;;
build)
    for x in $i
    do
      nc -w 2 -z $x 22
      [ "$?" != "0" ] && continue
      ssh $x 'wget -qO - http://192.168.66.253:8888/hdp330 | bash'
      ssh $x 'wget -qO - http://192.168.66.253:8888/spk300 | bash'
      ssh $x 'wget -qO - http://192.168.66.253:8888/dt.bash | sudo tee /opt/bin/dt.bash &>/dev/null'
      [ "$?" == "0" ] && echo "dt.bash copied"
      ssh $x 'wget -qO - http://192.168.66.253:8888/environment | sudo tee /home/kuan/.ssh/environment &>/dev/null'
      [ "$?" == "0" ] && echo "environment copied"

      for y in core-site.xml hadoop-env.sh hdfs-site.xml mapred-site.xml yarn-site.xml
      do
        u="wget -qO - http://192.168.66.253:8888/$y | sudo tee /opt/hadoop-3.3.0/etc/hadoop/$y &>/dev/null"
        ssh $x $u
        [ "$?" == "0" ] && echo "$y copied"
      done
      echo ""
    done
    cp ~/wk/cdt/bin/cluster/*.xml ~/wk/cdt/mnt/hdp330
    cp ~/wk/cdt/bin/cluster/*.sh ~/wk/cdt/mnt/hdp330
    ;;
list)
    ;;
restart)
    read -p "Are you sure ? (YES/NO) " ans
    [ $ans != "YES" ] && exit 1
    for x in $i
    do
      nc -w 2 -z $x 22 &>/dev/null
      [ "$?" != "0" ] && continue
      ssh $x 'sudo reboot' 
    done
    ;;
esac

bigred@gw:~/wk/cnt/dtc/bin$ dt sysinfo

CDT 20.08

[mas01]

Memory : 3.3G CPU : Intel(R) Atom(TM) x5-Z8350 CPU @ 1.44GHz (core: 4) IP Address : 192.168.40.10 Default Gateway : 192.168.40.254

openjdk version "1.8.0_265"

/etc/hosts 127.0.0.1 localhost 192.168.40.254 gw 192.168.40.10 mas01 192.168.40.20 wka01 192.168.40.21 wka02 192.168.40.22 wka03 192.168.40.23 wka04 192.168.40.30 ds01

[wka01]

Memory : 3.3G CPU : Intel(R) Atom(TM) x5-Z8350 CPU @ 1.44GHz (core: 4) IP Address : 192.168.40.20 Default Gateway : 192.168.40.254

openjdk version "1.8.0_265"

/etc/hosts 127.0.0.1 localhost 192.168.40.254 gw 192.168.40.10 mas01 192.168.40.20 wka01 192.168.40.21 wka02 192.168.40.22 wka03 192.168.40.23 wka04 192.168.40.30 ds01

[wka02]

Memory : 3.3G CPU : Intel(R) Atom(TM) x5-Z8350 CPU @ 1.44GHz (core: 4) IP Address : 192.168.40.21 Default Gateway : 192.168.40.254

openjdk version "1.8.0_265"

/etc/hosts 127.0.0.1 localhost 192.168.40.254 gw 192.168.40.10 mas01 192.168.40.20 wka01 192.168.40.21 wka02 192.168.40.22 wka03 192.168.40.23 wka04 192.168.40.30 ds01

[wka03]

Memory : 3.3G CPU : Intel(R) Atom(TM) x5-Z8350 CPU @ 1.44GHz (core: 4) IP Address : 192.168.40.22 Default Gateway : 192.168.40.254

openjdk version "1.8.0_265"

/etc/hosts 127.0.0.1 localhost 192.168.40.254 gw 192.168.40.10 mas01 192.168.40.20 wka01 192.168.40.21 wka02 192.168.40.22 wka03 192.168.40.23 wka04 192.168.40.30 ds01

[wka04]

Memory : 3.3G CPU : Intel(R) Atom(TM) x5-Z8350 CPU @ 1.44GHz (core: 4) IP Address : 192.168.40.23 Default Gateway : 192.168.40.254

openjdk version "1.8.0_265"

/etc/hosts 127.0.0.1 localhost 192.168.40.254 gw 192.168.40.10 mas01 192.168.40.20 wka01 192.168.40.21 wka02 192.168.40.22 wka03 192.168.40.23 wka04 192.168.40.30 ds01

nano formathdfs.sh

#!/bin/bash
read -p "Are you sure ? (YES/NO) " ans
[ $ans != "YES" ] && echo "abort format HDFS" && exit 1

ssh master rm -r nn/* &>/dev/null
ssh master rm -r sn/* &>/dev/null

for n in wka01 wka02 wka03 wka04 wka05
do
   nc -w 1 -z $n 22 &>/dev/null
   [ "$?" == "0" ] && ssh $n rm -r dn/* &>/dev/null
   echo "$n clean"
done

ssh master 'hdfs namenode -format -clusterID cute' &>/dev/null
[ "$?" != "0" ] && echo "formathdfs failure" && exit 1
echo "formathdfs ok"

nano starthdfs.sh

#!/bin/bash
ssh master hadoop-daemon.sh start namenode &>/dev/null
sleep 10; nc -w 5 -z master 8020 &>/dev/null
[ "$?" != 0 ] && echo "pls formathdfs first" && exit 1
echo "master: Name Node Started"

ssh master hadoop-daemon.sh start secondarynamenode &>/dev/null
[ "$?" == "0" ] && echo "master: Secondary Name Node started"

for n in wka01 wka02 wka03 wka04 wka05
do
   nc -w 5 -z $n 22 &>/dev/null
   if [ "$?" == "0" ]; then
      ssh $n hadoop-daemon.sh start datanode &>/dev/null
      [ "$?" == "0" ] && echo "$n: Data Node started"
   fi
done
  • HDFS 的好處:

    • 處理超大文件。

    • 指令退休鍵,使升級不再麻煩。

    • 運行於廉價的商用機器集群上。

    • 可以無限制的擴展硬碟容量。

    • 擁有分散式的讀取方式。

      • 當 client 端寫入達 16M 以上,Name Node 會給予三個位址寫入 (複製幾份可由 hdfs-site.xml 裡面設定),每 128M 切成一個 Block。

      • 當 client 端讀取。

http://hadoop.apache.org/