{ 2.1 } Installing Hadoop - GemaAlbiach/Bicing-RealTime-Occupancy-estimation GitHub Wiki

Create Hadoop Cluster on Virtual Machine

Download Centos Image

Basic configuration for all the Hadoop nodes (1 Master, 4 Slaves)

  1. Open the VM as a copy

  2. Change the Virtual machine name (you can use Master, Slave1, …​)

  3. Change pwd

    >passwd
  4. Logon as root (pwd: tomtom)

  5. Create a user as sudoer

  1. Logon with sudoer user

  2. Change hostname in a terminal (as root)

    >hostname masterbicing
  3. Assure hostname has been change using "hostname" command (with options)

  4. Find the machine ip (the 4 numerals after "inet addr" from ethernet)

  5. (We will use the file /etc/hosts instead of DNS Server). Add all the host to /etc/hosts file

    >192.168.1.128 masterbicing<br>
    >192.168.1.129 slave1
  6. Use ping command to assure each machine can find all the other machine by its hostname

  7. Edit /etc/sysconfig/network file and change the HOSTNAME by the new hostname

  8. Open a web navigator and download hadoop from Hadoop Web. File: hadoop-1.2.1-bin.tar.gz

  9. Copy the file to /opt directory and unzip it

    >cd /opt<br>
    >tar xvfz hadoop-1.2.1-bin.tar.gz
  10. Add Hadoop to the path

    >echo 'export PATH=$PATH:/opt/hadoop­-1.2.1/bin' 
    > /etc/profile.d/hadoop.sh<br>
  11. Download Development java kit 6u31 from Oracle (it is needed to be registered). File jdk-6u31-linux-x64-rpm.bin

  12. Move the binary file to /opt folder and allow execution

    >cd /opt<br>
    >chmod u+x jdk­6u31­linux­x64­rpm.bin
  13. Execute the binary to install java

    >./jdk­6u31­linux­x64­rpm.bin
  14. Add the binaries to the path

    >echo 'export PATH=$PATH:/usr/java/default/bin/' > /etc/profile.d/java.sh
  15. Close the terminal and open a new one (again as a root). Assure which java version is installed. (It should be java 1.6.0_31)

    >java -version

Install and Launch Hadoop

  1. Go to Hadoop configuration folder. Add below lines at the begining of /opt/hadoop-1.2.1/conf/hadoop-env.sh file

    >JAVA_HOME=/usr/java/default<br>
    >export HADOOP_HEAPSIZE=12
  2. Copy the 3 files from resources file (this github project) to /opt/hadoop-1.2.1/conf/

    >mapred-site.xml<br>
    >hdfs-site.xml<br>
    >core-site.xml
  3. Assure mapred-site.xml file has the correct name of the master (in my case is masterbicing)

    <value>masterbicing:8021</value>
  4. Create the HDFS folders.

    >mkdir -pv /srv/data/dfs/nn /srv/data/dfs/dn<br>
    >mkdir -pv /srv/data/dfs/sn
  5. ONLY FOR MASTER MACHINE. Format HDFS and validate it.

    >hadoop namenode -format<br>
    >ls -ltrh /srv/data/dfs/nn/
  6. Patch the service starter scripts.

    >sed -i 's/hadoop­daemons/hadoop­daemon/g' /opt/hadoop­1.2.1/bin/start­dfs.sh
  7. ONLY FOR MASTER MACHINE. Edit /opt/hadoop-1.2.1/bin/start-dfs.sh file and commnet (#) the line "start datanode"

    $bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start datanode $dataStartOpt<br>
  8. ONLY FOR SLAVE MACHINES. Edit /opt/hadoop-1.2.1/bin/start-dfs.sh file and commnet (#) the lines that contains "start namenode” and “start secondarynamenode”.

    "$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode $nameStartOpt<br>
    "$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR --hosts masters star
  9. Assure shh is running

    >service sshd start
  10. Firewall blocks connections. Configure iptables: We can configure iptables to allow all connections, if these nodes are in a secure local area network which is most of the situation, by this command on all nodes:

    >iptables -F<br>
    >service iptables save
  11. When all previous steps have been executed for all the servers (master and slaves) launch DFS daemons.

    >cd /opt/hadoop-1.2.1/bin<br>
    >./start-dfs.sh
  12. List all run java process in each node. Master node should have "NameNode" and "SecondaryNameNode". Slave node should have "DataNode".

    >jps -m
  13. Open a browser and navigate to HDFS webUI. All the 4 slaves nodes should appear as "Live Nodes"

    >http://masterbicing:50070

MapReduce Installation

  1. Create the mapreduce temporal folder

    >mkdir -pv /srv/data/mapred/local
  2. Patch the service starter scripts.

    >sed -i 's/hadoop­daemons/hadoop­daemon/g' /opt/hadoop­1.2.1/bin/start­mapred.sh
  3. ONLY FOR MASTER MACHINE. Edit /opt/hadoop-1.2.1/bin/start-mapred.sh file and commnet (#) the line "start tasktracker"

    "$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start tasktracker
  4. ONLY FOR SLAVE MACHINES. Edit /opt/hadoop-1.2.1/bin/start-mapred.sh file and commnet (#) the line that contains "start jobtracker".

    "$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start jobtracker
  5. When all previous steps have been executed for all the servers (master and slaves) launch MR daemons.

    >cd /opt/hadoop-1.2.1/bin<br>
    >./start-mapred.sh
  6. List all run java process in each node. Master node should have "JobTracker". Slave node should have "TaskTracker".

    >jps -m

Test MapReduce

  1. On one of the servers, open a terminal as root and create the user home.

    >hadoop fs -mkdir /user/root<br>
    >hadoop fs -chown root:root /user/root<br>
    >hadoop fs -mkdir /tmp/input <br>
    >hadoop fs -put /etc/passwd /tmp/input
  2. Go to the file with a sample jar and run the job MR that do a "grep" from the "bash" token to the uploaded file.

    >cd /opt/hadoop-1.2.1<br>
    >hadoop jar hadoop-examples-1.2.1.jar grep /tmp/input  /tmp/output bash
  3. See the result

    >hadoop fs -cat /tmp/output/part­00000
  4. Find de job in the JT webUI (http://masterbicing:50030)

⚠️ **GitHub.com Fallback** ⚠️