Learning Hadoop - mikec964/chelmbigstock GitHub Wiki
I attended a terrific workshop by Arun Buduri called Full-day Hadoop MapReduce Hands-On. These notes are from there. He provided notes on setting up your laptop, and operating Hadoop.
These instructions are an expansion of the class and notes from Arun's workshop, plus many hours of troubleshooting.
- Set up your laptop
- Install Hadoop
- Start and configure Hadoop
- Run some Hadoop projects
- Build a Hadoop cluster (with friends)
- Configuration files
- Python and Hadoop
- Other Resources
Overview:
- Windows users, set up for dual boot Ubuntu. Mac users, relax for a few days while the Windows folks get through this step.
- Create a 'hadoop' user
- Install Java
- Setup SSH so it doesn't require a password to login to localhost
You can make your laptop dual boot, or you can create an Ubuntu bootable flash drive.
Basically:
- Get the .iso image of Ubuntu here. A USB 3 port and a fast USB 3 drive are recommended; otherwise the bootable USB setup will be too slow. Also be sure to set a large persistent storage area when creating a bootable USB drive.
- Persistent storage can be extended beyond the 4GB FAT32 limit. Here is a link that shows how to initialize your USB stick with maximum persistent storage. If you already have only 4GB persistence you can preserve your current work and extend your storage to utilize the entire USB stick with this process.
- How to make a bootable USB from an ISO, on Mac OS X
- How to boot Ubuntu from a flash drive
Create a user, hadoop, and log in as that user. You can run Hadoop without this step, but having a unified username across everyone in the class will help when we put the machines together into an actual cluster.
If I could re-create this tutorial, I'd name the user 'hduser' so all the paths in the instructions would be easier to read. But, as a group we are too far down the 'hadoop' path to turn back now.
We're all running Java 1.7.
- If you have OpenJDK installed already, remove it http://docs.oracle.com/javase/7/docs/webnotes/install/mac/mac-jdk.html#install
- Oracle Java 7 can be downloaded at http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html
- Remember the location you installed Java (save the path in a text file if needed)
Type java -version
from the terminal, you should see something like:
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)
Type jps
from the terminal, you should see something like 5350 jps
. (The number doesn't matter.)
Setup SSH so it can log in to localhost without a password prompt. (Read his notes for this.)
Test it by typing ssh localhost
. It should give you a prompt without asking for a password. It might ask you to add a new address to your host list. Go ahead and do that.
You can see if your SSH is working by checking to see if it's in the process list.
SSH problems? Here are some things to try:
- If you failed to install ssh, try these commands to clean up the package index.
sudo apt-get clean && sudo apt-get update
sudo apt-get upgrade
- Trouble with SSH? Try this on cyberciti.biz
- SSH connection refused? Try these troubleshooting steps. If you just did the steps in the instructions, reboot your machine before you try to validate ssh is working.
- SSH server not working? http://www.linuxquestions.org/questions/linux-newbie-8/how-to-install-ssh-server-on-ubuntu-789321-print/
- The authenticity of host localhost can't be established? Check out
- Mac user getting "Unable to load realm info from SCDynamicStore"? Try this on StackOverflow.com
- Download Hadoop 1.2.1
You can download Hadoop from here. For this class, you want Hadoop 1.2.1.
Untar instructions, if you're new to tar.
- Edit hadoop-env.sh to define JAVA_HOME. Where is your Java home? For Mac OS
- In Hadoop 1.2, this is in hadoop-1.2.1/conf
- In Hadoop 2.6, this is in hadoop-2.6.0/etc/hadoop
- Change the line 'export JAVA_HOME=${JAVA_HOME}' to something like (note that where you see the quotes, they should be backticks):
export JAVA_HOME=`/usr/libexec/java_home -v 1.7`
-
Also in Hadoop 2.6, add:
export HADOOP_PREFIX='/Users/hadoop/hadoop/hadoop-2.6.0
(or whatever the directory is where you installed your Hadoop) -
On the Apache Single Node Setup page, follow the instructions for the Pseudo-Distributed Operation.
In Hadoop 2.6, you can test your installation with:
-
$ bin/hadoop
This will provide usage information -
$ bin/hadoop version
This provides the version. If it fails, check your HADOOP_PREFIX setting. -
$ bin/hadoop checknative -a
This will tell you if it is using the native library. I'm not sure how to fix this yet http://stackoverflow.com/questions/19943766/hadoop-unable-to-load-native-hadoop-library-for-your-platform-error-on-centos
Overview:
- Hadoop 1.2:
$ bin/hadoop namenode -format
- Hadoop 2.6:
$ bin/hdfs namenode -format
The head of the output should look something like this. Note that host name has no error. If it has a UnknownHostException problem, you may need to edit your /etc/hosts file.
15/01/31 11:05:05 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = Robin.local/10.31.1.87
The tail of the output should look something like this. Note the next to the last line, "has been successfully formatted." That's a win.
15/01/24 11:38:57 INFO common.Storage: Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted.
15/01/24 11:38:57 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at Robin.local/10.31.1.134
************************************************************/
Use: $ bin/start-all.sh
Sample start-all output
See if all of the daemons are running:
$ jps
13383 Jps
13929 Namenode
13583 SecondaryNameNode
13783 TaskTracker
13838 DataNode
13221 JobTracker
Go to the web consoles for namenode (http://localhost:50070) and jobtracker (http://localhost:50030).
Use: $ sbin/start-dfs.sh
# This starts only namenode, secondarynamenode, and datanode
Sample start-dfs.sh output
See if all the daemons are running:
$ jps
13383 Jps
13929 Namenode
13583 SecondaryNameNode
13838 DataNode
Go to the web consoles for namenode (http://localhost:50070).
(Okay, if you're running Hadoop 2.6, then you have HDFS but not MapReduce/YARN working, but it's a start.)
You can stop Hadoop with:
- Hadoop 1.2:
$ bin/stop-all.sh
- Hadoop 2.6:
$ sbin/stop-dfs.sh
Okay, maybe you don't have Hadoop running.
If namenode isn't running, it might be because the Hadoop distributed file system hasn't been formatted. Stop the daemons $ bin/stop-all.sh
, then format the Hadoop drive $ bin/hadoop namenode -format
. After that, you should be able to start Hadoop.
This video explains what the SecondaryNameNode is and isn't.
If a particular daemon isn't running, you can troubleshoot by reading the log file for that daemon. Log files are in the hadoop-1.2.1/logs
.
-
DataNode error, Incompatible namespaceIDs: Stack Overflow discussion. Sample DataNode namespaceID error output The short answer (might not be the same directory on your machine) is to delete the data directory on DataNode. On my Mac, I removed everything in /tmp/hadoop-hadoop/dfs/data.
-
UnknownHostException: Try rebooting your machine. I'm not sure what files need to be cleared, but a reboot has fixed this for me a few times. Logging out and back in didn't do the trick.
STARTUP_MSG: host = java.net.UnknownHostException: Robin: Robin: nodename nor servname provided, or not known
- Initialization failed for Block pool: See this on StackOverflow(http://stackoverflow.com/questions/16020334/hadoop-datanode-process-killed)
Overview:
- Copy data from your local drive into the Hadoop cluster
- Run a Java program on the cluster
- Get the output from the cluster
`$ bin/hadoop namenode -format`
`$ bin/start-all.sh`
`$ bin/stop-all.sh`
`$ bin/hadoop dfs -ls [path]`
`$ bin/hadoop dfs -mkdir input`
`$ bin/hadoop dfs -put [src] [dst]`
`$ bin/hadoop dfs -get [file]`
`$ bin/hadoop dfs -rm [file]`
`$ bin/hadoop dfs -rmr [directory]`
`$ bin/hdfs namenode -format`
`$ sbin/start-dfs.sh`
`$ sbin/stop-dfs.sh`
`$ bin/hdfs dfs -ls [path]`
`$ bin/hdfs dfs -mkdir input`
`$ bin/hdfs dfs -put [src] [dst]`
`$ bin/hdfs dfs -get [file]`
`$ bin/hdfs dfs -rm [file]`
`$ bin/hdfs dfs -rmr [directory]`
For Hadoop 2.6, you might first follow the test job instructions on the docs page. If you follow them, here are some Hadoop 2.6 1st run troubleshooting notes.
First, create a weather directory parallel to hadoop-1.2.1 with the weather project. Move the weather_data file and the files that are inside 'playground' into the new weather directory. This step just makes it a little more clear which files belong to the weather test project. I've changed the instructions in the rest of the examples to use the new paths.
This is the original setup:
hadoop-1.2.1/
weather_data
playground/
classes/
src/
This is the new setup:
hadoop-1.2.1/
weather/
weather_data
classes/
src/
Next, change the working directory to the hadoop-1.2.1 folder. Now create a new folder in the cluster to hold the file you will upload:
- Hadoop 1.2:
$ bin/hadoop dfs -mkdir input
- Hadoop 2.6:
$ bin/hdfs dfs -mkdir /user
, then$ bin/hdfs dfs -mkdir /user/hadoop
, then$ bin/hdfs dfs -mkdir /usr/hduser/input
You can browse the namenode to see if the input directory has been created. You'll drill down through /user/hadoop/input (hadoop is your username).
Copy the data into the cluster:
- Hadoop 1.2:
bin/hadoop dfs -put ../weather/weather_data input
- Hadoop 2.6:
bin/hdfs dfs -put ../weather/weather_data /user/hadoop/input
This will take a while. Browse using namenode again to make sure the file is there. It's about 228MB.
First, go into the weather project directory and compile the Java source code into the classes and then a jar file.
Hadoop 1.2:
cd ../weather
javac -classpath ../hadoop-1.2.1/hadoop-core-1.2.1.jar src/WeatherJob.java -d classes/
jar -cvf WeatherJob.jar -C classes/ .
Hadoop 2.6:... Not fixed yet.
cd ../weather
javac -classpath ../hadoop-1.2.1/hadoop-core-1.2.1.jar src/WeatherJob.java -d classes/
jar -cvf WeatherJob.jar -C classes/ .
Run job:
cd ../hadoop-1.2.1
bin/hadoop jar ../weather/WeatherJob.jar WeatherJob input output
If you run job again and you'll get an 'Output directory output already exists' error. You have to delete the output directory before it will work again:
bin/hadoop dfs -rmr output
Copy all the output files to your local drive:
bin/hadoop dfs -getmerge output/ ../weather/weatherJobOutput.txt
(If you've run this before, you may need to delete your weatherJobOutput.txt file before you run the above command. rm ../weather/weatherJobOutput.txt
)
Now if you look in your weather directory you will see the weatherJobOutput.txt file.
Robin:hadoop-1.2.1 hadoop$ cd ../weather
Robin:weather hadoop$ ls -l
total 470672
-rw-r--r-- 1 hadoop staff 3232 Jan 24 15:38 WeatherJob.jar
drwxr-xr-x 5 hadoop staff 170 Jan 10 12:22 classes
drwxrwxr-x@ 3 hadoop staff 102 Oct 18 17:45 src
-rwxrwxrwx 1 hadoop staff 1823073 Jan 24 15:46 weatherJobOutput.txt
-rw-rw-r--@ 1 hadoop staff 239149574 Sep 25 18:17 weather_data
And if you look at the file (first 10 lines, using head), you'll see the results:
Robin:weather hadoop$ head weatherJobOutput.txt
COUNTRY1 - Avg 49
COUNTRY1 - Min 0
COUNTRY1 - Max 99
COUNTRY10 - Avg 49
COUNTRY10 - Min 0
COUNTRY10 - Max 99
COUNTRY10ZIP1 - Avg 49
COUNTRY10ZIP1 - Min 0
COUNTRY10ZIP1 - Max 99
COUNTRY10ZIP10 - Avg 48
All of the laptops must have the same username. Ubuntu running on a USB has hostname: 'ubuntu'. See http://askubuntu.com/questions/87665/how-do-i-change-the-hostname-without-a-restart for a script to change the hostname. This change is optional. This change will not be persistent on a USB installation. If used must be run FROM YOUR haddop account.
We're going to try Michael Noll's tutorial.
We have a wireless routers available to network the cluster. See the instructions on how to connect to and administer the routers [here] (https://secure.sourcehosting.net/ChelmsfordSkillsSh/trac/attachment/wiki/DataAnalytics/Clustering%20Router%20Setup.rtf).
Step one is to add your cluster machines to your /etc/hosts file.
-
ifconfig
to get your ip address. -
hostname
to get your host name. -
sudo vi /etc/hosts
. Here are the Feb 7 hosts. -
ping -c 5 <hostname>
orping <ip address>
to see if your computer can reach the others.
On the Mac, it may be configured to allow it to ping other machines, but not respond to pings. Use System Prefs/Security & Privacy/Firewall/Firewall options/Stealth mode = Off.
Step two is to make sure the master can ssh into the slave machines without having to enter a password. Michael Noll's tutorial uses the ssh-copy-id command, but Mac OS (and maybe others) don't have that. Here's an alternative that will add your public key to a remote machine.
cat ~/.ssh/id_rsa.pub | ssh user@machine "mkdir ~/.ssh; cat >> ~/.ssh/authorized_keys"
#!/bin/bash
# JGB - script to clear out the dfs and dfstemp directories
# This script assumes both directories are in your home directory.
cd ~/dfs
echo
echo "pwd = "
pwd
echo
echo "ls ="
ls
echo
echo "REMOVING ALL SUBDIRECTORIES AND FILES FROM: dfs!"
rm -rf *
echo
echo "ls"
ls
cd ~/dfstemp
echo
echo "pwd = "
pwd
echo
echo "ls ="
ls
echo
echo "REMOVING ALL SUBDIRECTORIES AND FILES FROM: dfstemp!"
rm -rf *
echo
echo "ls ="
ls
echo
echo "done. . . . you're welcome;-)"
We're looking at several frameworks for this: http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
We have run the [Hortonworks Sandbox] (http://hortonworks.com/products/hortonworks-sandbox/) tutorials. Note: In order to install and run these tutorials you need a laptop with at least 8GB of RAM, otherwise they will run way to slowly.
Another tutorial: http://hadoop-tutorial.blogspot.com/2010/11/running-hadoop-in-pseudo-distributed.html