How to Setup, Build, and Use Giraffa 0.0.1 (legacy) - GiraffaFS/giraffa GitHub Wiki
- Download and install Maven:
wget http://apache.mesi.com.ar/maven/maven-3/3.0.4/binaries/apache-maven-3.0.4-in.tar.gz
tar -zxvf apache-maven-3.0.4-bin.tar.gz
sudo mv apache-maven-3.0.4 /usr/local
sudo mv apache-maven-3.0.4 /usr/local
sudo ln -s /usr/local/apache-maven-3.0.4/ /usr/local/maven
- Configure ∼/.bashrc, make sure that you have the following section in this file:
M2_HOME=/usr/local/maven
export M2=$M2_HOME/bin
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512m"
export PATH=$JAVA_HOME/bin:$M2_HOME/bin:$PATH
- Check that Maven is correctly setup
mvn -version
Apache Maven 3.0.4 (r1232337; 2012-01-17 00:44:56-0800)
- For further instructions refer to "Installation Instructions" on http://maven.apache.org/download.cgi
- Using Git, clone our repository:
git clone https://code.google.com/a/apache-extras.org/p/giraffa/
- Check out trunk:
git checkout trunk
Giraffa is using using Maven as a build tool. Main pom.xml file is located in the giraffa directory. Here's a list of different options:
- Build Giraffa and run all the tests:
mvn clean install
Note: by default all test output is redirected to files under target/surefire-reports. If you want tests to output to console, then edit pom.xml file and set redirectTestOutputToFile=false, or set it during your maven command execution.
- Build Giraffa without tests
mvn clean install -DskipTests
- Build Giraffa Project site:
mvn clean site
When build is complete, you can access site at${basedir}/target/site/index.html
- Build Giraffa Site with Clover report:
mvn -Pclover site
When build is complete, you can access site at${basedir}/target/site/index.html
.
Note: You will need to place yourclover.license
file in${user.home}/.m2/clover.license
WARNING! Clover plugin instruments source files and and it should not be used for production!
In demo mode, Giraffa will start embeded Hadoop MiniCluster, Hive, Web UI. You will be able to perform all supported operations through Giraffa Web Interface:
mvn -Pwebdemo
- Navigate to http://localhost:40010.
- Type "stop" in maven console to stop the demo server.
- Copy hadoop-0.22.0 directory from unarchived download of Hadoop 0.22.0 to giraffa-standalone/target/giraffa-standalone/ directory and rename it to just "hadoop".
- Copy hbase-0.94.1 directory from unarchived download of HBase 0.94.1 to giraffa-standalone/target/giraffa-standalone/ directory and rename it to just "hbase".
- (The rest of the instructions assume the current directory is now giraffa-standalone/target/giraffa-standalone/)
- Remove hadoop-core-
*
.jar from hbase/lib and copy hadoop/hadoop-*
.jar files into hbase/lib.
- Copy giraffa/lib/giraffa-standalone-VERSION-SNAPSHOT.jar to hbase/lib.
- In hbase/conf, create an empty hdfs-site.xml and core-site.xml:
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration></configuration>
- In hadoop/conf, modify hdfs-site.xml:
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration><property><name>fs.defaultFS</name><value>hdfs://localhost:9000</value></property></configuration>
- In hbase/conf, modify hbase-site.xml:
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration><property><name>hbase.rootdir</name><value>hdfs://localhost:9000/hbase</value></property><property><name>hbase.coprocessor.master.classes</name><value>org.apache.giraffa.web.GiraffaWebObserver</value></property></configuration>
- Make sure environment variables HADOOP_HOME, HADOOP_COMMON_HOME, and HBASE_HOME are not set.
- Do
giraffa/bin/giraffa namenode -format
command first, so NameNode and DataNode start up properly. If this is a re-attempt you want to delete all your /tmp/hadoop and /tmp/hbase directories and files.
- Do
giraffa/bin/start-giraffa.sh
command.
- Do
giraffa/bin/giraffa format
command to format Giraffa.
- Do any
giraffa/bin/giraffa fs -[op]
command to create and access files in Giraffa, the same way as thehadoop fs -[op]
command is used to access HDFS data.
- (Optional) Run TestBlockManagement from Eclipse, which executes TestBlockManagement.main(). This will write and read file(s).
- Use
giraffa/bin/stop-giraffa.sh
to stop the Giraffa cluster.
-
NOTES: This will set up a multinode Giraffa cluster by configuring the HDFS servers (NameNode and DataNodes), HBase servers (Master and RegionServers), and Giraffa Clients. You must know the hostnames of the nodes hosting these components, although they do not necessarily have to be unique. For example, in the Standalone Cluster, every component is hosted on the same node and therefore has the same hostname. However, there are restrictions: every component must be on the same LAN, there may be only one NameNode on the cluster, and there may be only one DataNode, RegionServer, and Master on a single node. In the following instructions, replace NAMENODE with the hostname of the node hosting the NameNode.
-
PREREQUISITES: Follow steps 1 through 9 in "How to run Giraffa Standalone Cluster (Single-Node)" for every HDFS server, HBase server, and Giraffa Client. Ensure that
giraffa
is installed at the same location on each server. The rest of the instructions assume the current directory on each node is giraffa-standalone/target/giraffa-standalone.
-
CONFIGURATION: The instructions below specify configuration files and map property names to values. These should be added or changed inside the
<configuration></configuration>
block of the files using the format:<property><name>NAME</name><value>VALUE</value></property>
- HDFS Configuration: On every NameNode and DataNode:
- hadoop/conf/core-site.xml: fs.defaultFS => hdfs://NAMENODE:9000
- hadoop/conf/hdfs-site.xml: fs.defaultFS => hdfs://NAMENODE:9000
- hadoop/conf/core-site.xml: fs.defaultFS => hdfs://NAMENODE:9000
- HDFS Configuration: On NameNode only:
- hadoop/conf/masters: (should contain just one line that says: NAMENODE)
- hadoop/conf/slaves: (list each DataNode hostname, one per line)
- hadoop/conf/masters: (should contain just one line that says: NAMENODE)
- HBase Configuration: On every Master and RegionServer:
- hbase/conf/hbase-site.xml:
-
hbase.rootdir => hdfs://NAMENODE:9000/hbase
-
hbase.cluster.distributed => true
-
hbase.zookeeper.quorum => NAMENODE
-
- hbase/conf/hbase-site.xml:
- HBase Configuration: On Master only:
- hbase/conf/regionservers: (list each RegionServer hostname, one per line)
- hbase/conf/regionservers: (list each RegionServer hostname, one per line)
- Giraffa Configuration: On every Giraffa Client:
- giraffa/conf/core-site.xml:
-
hbase.rootdir => hdfs://NAMENODE:9000/hbase
-
hbase.coprocessor.master.classes => org.apache.giraffa.web.GiraffaWebObserver
-
hbase.cluster.distributed => true
-
hbase.zookeeper.quorum => NAMENODE
-
- giraffa/conf/core-site.xml:
- HDFS Configuration: On every NameNode and DataNode:
-
STARTING:
- Start HDFS. Complete the following on the NameNode:
- Run
giraffa/bin/giraffa namenode -format
. If this is a re-attempt, delete /tmp/hadoop and /tmp/hbase files first.
- Run
hadoop/bin/start-dfs.sh
- Run
- Start HBase: Complete the following on the Master:
- Run
hbase/bin/start-hbase.sh
- Run
- Format Giraffa: Complete the following on the NameNode:
- Run
giraffa/bin/giraffa format
- Run
- Verify: To check that start-up has completed successfully, run
jps
on each HDFS and HBase server. The NameNode should have process NameNode and SecondaryNameNode. The Master should have process HMaster. Each DataNode should have process DataNode. Each RegionServer should have process HRegionServer. The SecondaryNameNode process is not necessary for Giraffa and may be killed manually.
- Start HDFS. Complete the following on the NameNode:
-
RUNNING: Complete the following on a Giraffa Client:
- Do any
giraffa/bin/giraffa fs -[op]
command to create and access files in Giraffa, the same way as thehadoop fs -[op]
command is used to access HDFS data.
- Do any
-
STOPPING:
- Run
hadoop/bin/stop-hbase.sh
on the Master
- Run
hbase/bin/stop-dfs.sh
on the NameNode
- Run
-
WEB UI: Type HOSTNAME:PORT into the browser of any machine on the LAN to access the web UI of the following components (if this does not work, replace the hostname with the IP address, or alternatively, add the hostname/ip address pairs to your
hosts
file):
- NameNode: Port 50070
- DataNode: Port 50075
- Master: Port 60010
- RegionServer: Port 60030
- NameNode: Port 50070
YARN setup in Giraffa is identical to YARN setup in HDFS, with the exception that configuration files and executables are in a different location. In giraffa/conf, notice the following files:
yarn-env.sh
yarn-site.xml
mapred-env.sh
mapred-site.xml
Edit these files as you normally would. They have been pre-configured to run Tera jobs from the mapreduce examples jar. A couple of notes:
mapreduce.terasort.simplepartitioner is set to true. This is a configuration specific to the examples jar that ensures the distributed cache is not used. You should make sure that your jobs do not use the distributed cache as it requires currently unsupported features from Giraffa.
yarn.application.classpath is set to the default value, with the addition of $GIRAFFA_CLASSPATH. This ensures that Yarn jobs run with a class path that includes Giraffa. Do not remove $GIRAFFA_CLASSPATH from here.
Also notice the following files:
capacity-scheduler.xml
configuration.xsl
container-executer.cfg
These files are identical to the ones normally found in the hadoop configuration directory. If it turns out you need any additional configuration files, drop them in this directory.
In giraffa/bin, notice the following files:
yarn-giraffa
yarn-giraffa-daemon.sh
These are the Giraffa equivalents of the yarn and yarn-daemon.sh scripts you normally use to start jobs. For example, to start the resource manager and node manager, run:
yarn-giraffa-daemon.sh start resourcemanager
yarn-giraffa-daemon.sh start nodemanager
Then to run a teragen job from the examples jar, generating 10,000,000 rows in the directory “input”
yarn-giraffa jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.5-alpha.jar teragen 10000000 input
- Running the command
giraffa/bin/start-giraffa.sh
will create a 1 node Giraffa cluster. It starts up NameNode, DataNode, and then HBase, which starts a RegionServer, Master, and ZooKeeper.
- Run hadoop/bin/hadoop-daemon.sh start namenode or hadoop/bin/hadoop-daemon start datanode to manually start a NameNode or DataNode on a server. Likewise, run hbase/bin/hbase start master or hbase/bin/hbase start RegionServer to manually start an HBase Master or Region Server.