Learning Hadoop - mikec964/chelmbigstock GitHub Wiki

We're studying Hadoop

I attended a terrific workshop by Arun Buduri called Full-day Hadoop MapReduce Hands-On. These notes are from there. He provided notes on setting up your laptop, and operating Hadoop.

These instructions are an expansion of the class and notes from Arun's workshop, plus many hours of troubleshooting.

  1. Set up your laptop
  2. Install Hadoop
  3. Start and configure Hadoop
  4. Run some Hadoop projects
  5. Build a Hadoop cluster (with friends)
  6. Configuration files
  7. Python and Hadoop
  8. Other Resources

Set up your laptop

Overview:

  1. Windows users, set up for dual boot Ubuntu. Mac users, relax for a few days while the Windows folks get through this step.
  2. Create a 'hadoop' user
  3. Install Java
  4. Setup SSH so it doesn't require a password to login to localhost

Dual Boot Ubuntu

You can make your laptop dual boot, or you can create an Ubuntu bootable flash drive.

Basically:

Create a Hadoop user

Create a user, hadoop, and log in as that user. You can run Hadoop without this step, but having a unified username across everyone in the class will help when we put the machines together into an actual cluster.

If I could re-create this tutorial, I'd name the user 'hduser' so all the paths in the instructions would be easier to read. But, as a group we are too far down the 'hadoop' path to turn back now.

Install Java

We're all running Java 1.7.

For a Mac:

Type java -version from the terminal, you should see something like:

java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)

Type jps from the terminal, you should see something like 5350 jps. (The number doesn't matter.)

Sample jps output

Setup SSH

Setup SSH so it can log in to localhost without a password prompt. (Read his notes for this.)

Test it by typing ssh localhost. It should give you a prompt without asking for a password. It might ask you to add a new address to your host list. Go ahead and do that.

You can see if your SSH is working by checking to see if it's in the process list.

Sample SSH output

Troubleshooting

SSH problems? Here are some things to try:

Install Hadoop

  • Download Hadoop 1.2.1

You can download Hadoop from here. For this class, you want Hadoop 1.2.1.

Untar instructions, if you're new to tar.

  • Edit hadoop-env.sh to define JAVA_HOME. Where is your Java home? For Mac OS
    • In Hadoop 1.2, this is in hadoop-1.2.1/conf
    • In Hadoop 2.6, this is in hadoop-2.6.0/etc/hadoop
    • Change the line 'export JAVA_HOME=${JAVA_HOME}' to something like (note that where you see the quotes, they should be backticks):
export JAVA_HOME=`/usr/libexec/java_home -v 1.7`
  • Also in Hadoop 2.6, add: export HADOOP_PREFIX='/Users/hadoop/hadoop/hadoop-2.6.0 (or whatever the directory is where you installed your Hadoop)

  • On the Apache Single Node Setup page, follow the instructions for the Pseudo-Distributed Operation.

In Hadoop 2.6, you can test your installation with:

Start and configure Hadoop

Overview:

  1. Format the HDFS drive
  2. Start up Hadoop

Format the Hadoop drive:

  • Hadoop 1.2: $ bin/hadoop namenode -format
  • Hadoop 2.6: $ bin/hdfs namenode -format

The head of the output should look something like this. Note that host name has no error. If it has a UnknownHostException problem, you may need to edit your /etc/hosts file.

15/01/31 11:05:05 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = Robin.local/10.31.1.87

The tail of the output should look something like this. Note the next to the last line, "has been successfully formatted." That's a win.

Sample namenode format output

15/01/24 11:38:57 INFO common.Storage: Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted.
15/01/24 11:38:57 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at Robin.local/10.31.1.134
************************************************************/

Start up Hadoop

Hadoop 1.2

Use: $ bin/start-all.sh Sample start-all output

See if all of the daemons are running:

$ jps
13383 Jps
13929 Namenode
13583 SecondaryNameNode
13783 TaskTracker
13838 DataNode
13221 JobTracker

Go to the web consoles for namenode (http://localhost:50070) and jobtracker (http://localhost:50030).

Hadoop 2.6

Use: $ sbin/start-dfs.sh # This starts only namenode, secondarynamenode, and datanode Sample start-dfs.sh output

See if all the daemons are running:

$ jps
13383 Jps
13929 Namenode
13583 SecondaryNameNode
13838 DataNode

Go to the web consoles for namenode (http://localhost:50070).

Congratulations, you've got Hadoop running!

(Okay, if you're running Hadoop 2.6, then you have HDFS but not MapReduce/YARN working, but it's a start.)

You can stop Hadoop with:

  • Hadoop 1.2: $ bin/stop-all.sh
  • Hadoop 2.6: $ sbin/stop-dfs.sh

Troubleshooting

Okay, maybe you don't have Hadoop running.

If namenode isn't running, it might be because the Hadoop distributed file system hasn't been formatted. Stop the daemons $ bin/stop-all.sh, then format the Hadoop drive $ bin/hadoop namenode -format. After that, you should be able to start Hadoop.

This video explains what the SecondaryNameNode is and isn't.

If a particular daemon isn't running, you can troubleshoot by reading the log file for that daemon. Log files are in the hadoop-1.2.1/logs.

  • DataNode error, Incompatible namespaceIDs: Stack Overflow discussion. Sample DataNode namespaceID error output The short answer (might not be the same directory on your machine) is to delete the data directory on DataNode. On my Mac, I removed everything in /tmp/hadoop-hadoop/dfs/data.

  • UnknownHostException: Try rebooting your machine. I'm not sure what files need to be cleared, but a reboot has fixed this for me a few times. Logging out and back in didn't do the trick.

STARTUP_MSG: host = java.net.UnknownHostException: Robin: Robin: nodename nor servname provided, or not known

Run some Hadoop projects

Overview:

  1. Copy data from your local drive into the Hadoop cluster
  2. Run a Java program on the cluster
  3. Get the output from the cluster

Some useful commands (Hadoop 1.2)

`$ bin/hadoop namenode -format`
`$ bin/start-all.sh`
`$ bin/stop-all.sh`

`$ bin/hadoop dfs -ls [path]`
`$ bin/hadoop dfs -mkdir input`
`$ bin/hadoop dfs -put [src] [dst]`
`$ bin/hadoop dfs -get [file]`
`$ bin/hadoop dfs -rm [file]`
`$ bin/hadoop dfs -rmr [directory]`

Some useful commands (Hadoop 2.6)

`$ bin/hdfs namenode -format`
`$ sbin/start-dfs.sh`
`$ sbin/stop-dfs.sh`

`$ bin/hdfs dfs -ls [path]`
`$ bin/hdfs dfs -mkdir input`
`$ bin/hdfs dfs -put [src] [dst]`
`$ bin/hdfs dfs -get [file]`
`$ bin/hdfs dfs -rm [file]`
`$ bin/hdfs dfs -rmr [directory]`

1. Copy data into the Hadoop cluster

For Hadoop 2.6, you might first follow the test job instructions on the docs page. If you follow them, here are some Hadoop 2.6 1st run troubleshooting notes.

First, create a weather directory parallel to hadoop-1.2.1 with the weather project. Move the weather_data file and the files that are inside 'playground' into the new weather directory. This step just makes it a little more clear which files belong to the weather test project. I've changed the instructions in the rest of the examples to use the new paths.

This is the original setup:

hadoop-1.2.1/
    weather_data
    playground/
        classes/
        src/

This is the new setup:

hadoop-1.2.1/
weather/
    weather_data
    classes/
    src/

Next, change the working directory to the hadoop-1.2.1 folder. Now create a new folder in the cluster to hold the file you will upload:

  • Hadoop 1.2: $ bin/hadoop dfs -mkdir input
  • Hadoop 2.6: $ bin/hdfs dfs -mkdir /user, then $ bin/hdfs dfs -mkdir /user/hadoop, then $ bin/hdfs dfs -mkdir /usr/hduser/input

You can browse the namenode to see if the input directory has been created. You'll drill down through /user/hadoop/input (hadoop is your username).

Copy the data into the cluster:

  • Hadoop 1.2: bin/hadoop dfs -put ../weather/weather_data input
  • Hadoop 2.6: bin/hdfs dfs -put ../weather/weather_data /user/hadoop/input

This will take a while. Browse using namenode again to make sure the file is there. It's about 228MB.

2. Run a program on the cluster

First, go into the weather project directory and compile the Java source code into the classes and then a jar file.

Hadoop 1.2:

cd ../weather
javac -classpath ../hadoop-1.2.1/hadoop-core-1.2.1.jar src/WeatherJob.java -d classes/
jar -cvf WeatherJob.jar -C classes/ .

Hadoop 2.6:... Not fixed yet.

cd ../weather
javac -classpath ../hadoop-1.2.1/hadoop-core-1.2.1.jar src/WeatherJob.java -d classes/
jar -cvf WeatherJob.jar -C classes/ .

Run job:

cd ../hadoop-1.2.1
bin/hadoop jar ../weather/WeatherJob.jar WeatherJob input output

Sample MapReduce output

If you run job again and you'll get an 'Output directory output already exists' error. You have to delete the output directory before it will work again:

bin/hadoop dfs -rmr output 

3. Get the output from the cluster

Copy all the output files to your local drive: bin/hadoop dfs -getmerge output/ ../weather/weatherJobOutput.txt

(If you've run this before, you may need to delete your weatherJobOutput.txt file before you run the above command. rm ../weather/weatherJobOutput.txt)

Now if you look in your weather directory you will see the weatherJobOutput.txt file.

Robin:hadoop-1.2.1 hadoop$ cd ../weather
Robin:weather hadoop$ ls -l
total 470672
-rw-r--r--  1 hadoop  staff       3232 Jan 24 15:38 WeatherJob.jar
drwxr-xr-x  5 hadoop  staff        170 Jan 10 12:22 classes
drwxrwxr-x@ 3 hadoop  staff        102 Oct 18 17:45 src
-rwxrwxrwx  1 hadoop  staff    1823073 Jan 24 15:46 weatherJobOutput.txt
-rw-rw-r--@ 1 hadoop  staff  239149574 Sep 25 18:17 weather_data

And if you look at the file (first 10 lines, using head), you'll see the results:

Robin:weather hadoop$ head weatherJobOutput.txt 
COUNTRY1 - Avg	49
COUNTRY1 - Min	0
COUNTRY1 - Max	99
COUNTRY10 - Avg	49
COUNTRY10 - Min	0
COUNTRY10 - Max	99
COUNTRY10ZIP1 - Avg	49
COUNTRY10ZIP1 - Min	0
COUNTRY10ZIP1 - Max	99
COUNTRY10ZIP10 - Avg	48

Build a Hadoop cluster (with friends)

All of the laptops must have the same username. Ubuntu running on a USB has hostname: 'ubuntu'. See http://askubuntu.com/questions/87665/how-do-i-change-the-hostname-without-a-restart for a script to change the hostname. This change is optional. This change will not be persistent on a USB installation. If used must be run FROM YOUR haddop account.

We're going to try Michael Noll's tutorial.

Build a network

We have a wireless routers available to network the cluster. See the instructions on how to connect to and administer the routers [here] (https://secure.sourcehosting.net/ChelmsfordSkillsSh/trac/attachment/wiki/DataAnalytics/Clustering%20Router%20Setup.rtf).

Step one is to add your cluster machines to your /etc/hosts file.

  • ifconfig to get your ip address.
  • hostname to get your host name.
  • sudo vi /etc/hosts. Here are the Feb 7 hosts.
  • ping -c 5 <hostname> or ping <ip address> to see if your computer can reach the others.

On the Mac, it may be configured to allow it to ping other machines, but not respond to pings. Use System Prefs/Security & Privacy/Firewall/Firewall options/Stealth mode = Off.

Step two is to make sure the master can ssh into the slave machines without having to enter a password. Michael Noll's tutorial uses the ssh-copy-id command, but Mac OS (and maybe others) don't have that. Here's an alternative that will add your public key to a remote machine.

cat ~/.ssh/id_rsa.pub | ssh user@machine "mkdir ~/.ssh; cat >> ~/.ssh/authorized_keys"

Configuration files

Script to clear out dfs and dfstemp directories

cleardfs.sh

#!/bin/bash
# JGB - script to clear out the dfs and dfstemp directories
# This script assumes both directories are in your home directory.

cd ~/dfs
echo
echo "pwd = "
pwd
echo
echo "ls ="
ls
echo
echo "REMOVING ALL SUBDIRECTORIES AND FILES FROM: dfs!"
rm -rf *
echo
echo "ls"
ls

cd ~/dfstemp
echo
echo "pwd = "
pwd
echo
echo "ls ="
ls
echo
echo "REMOVING ALL SUBDIRECTORIES AND FILES FROM: dfstemp!"
rm -rf *
echo
echo "ls ="
ls
echo
echo "done. . . .  you're welcome;-)"

Python and Hadoop

We're looking at several frameworks for this: http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/

Other resources

We have run the [Hortonworks Sandbox] (http://hortonworks.com/products/hortonworks-sandbox/) tutorials. Note: In order to install and run these tutorials you need a laptop with at least 8GB of RAM, otherwise they will run way to slowly.

Another tutorial: http://hadoop-tutorial.blogspot.com/2010/11/running-hadoop-in-pseudo-distributed.html

⚠️ **GitHub.com Fallback** ⚠️