How to Setup, Build, and Use Giraffa 0.0.1 (legacy) - GiraffaFS/giraffa GitHub Wiki

How to setup Giraffa build environment

  1. Download and install Maven:
wget http://apache.mesi.com.ar/maven/maven-3/3.0.4/binaries/apache-maven-3.0.4-in.tar.gz
tar -zxvf apache-maven-3.0.4-bin.tar.gz
sudo mv apache-maven-3.0.4 /usr/local
sudo mv apache-maven-3.0.4 /usr/local
sudo ln -s /usr/local/apache-maven-3.0.4/ /usr/local/maven
  1. Configure ∼/.bashrc, make sure that you have the following section in this file:
M2_HOME=/usr/local/maven
export M2=$M2_HOME/bin
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512m"
export PATH=$JAVA_HOME/bin:$M2_HOME/bin:$PATH
  1. Check that Maven is correctly setup
mvn -version
Apache Maven 3.0.4 (r1232337; 2012-01-17 00:44:56-0800)
  1. For further instructions refer to "Installation Instructions" on http://maven.apache.org/download.cgi
  2. Using Git, clone our repository: git clone https://code.google.com/a/apache-extras.org/p/giraffa/
  3. Check out trunk: git checkout trunk

Giraffa build options

Giraffa is using using Maven as a build tool. Main pom.xml file is located in the giraffa directory. Here's a list of different options:

  • Build Giraffa and run all the tests:

    mvn clean install

    Note: by default all test output is redirected to files under target/surefire-reports. If you want tests to output to console, then edit pom.xml file and set redirectTestOutputToFile=false, or set it during your maven command execution.

  • Build Giraffa without tests

    mvn clean install -DskipTests

  • Build Giraffa Project site:

    mvn clean site

    When build is complete, you can access site at ${basedir}/target/site/index.html

  • Build Giraffa Site with Clover report:
    mvn -Pclover site

    When build is complete, you can access site at ${basedir}/target/site/index.html.
    Note: You will need to place your clover.license file in ${user.home}/.m2/clover.license
    WARNING! Clover plugin instruments source files and and it should not be used for production!

How to run Giraffa demo (Embedded MiniCluster)

In demo mode, Giraffa will start embeded Hadoop MiniCluster, Hive, Web UI. You will be able to perform all supported operations through Giraffa Web Interface:

mvn -Pwebdemo

How to run Giraffa Standalone (Single-Node Cluster)

  1. Copy hadoop-0.22.0 directory from unarchived download of Hadoop 0.22.0 to giraffa-standalone/target/giraffa-standalone/ directory and rename it to just "hadoop".
  2. Copy hbase-0.94.1 directory from unarchived download of HBase 0.94.1 to giraffa-standalone/target/giraffa-standalone/ directory and rename it to just "hbase".
  3. (The rest of the instructions assume the current directory is now giraffa-standalone/target/giraffa-standalone/)
  4. Remove hadoop-core-*.jar from hbase/lib and copy hadoop/hadoop-*.jar files into hbase/lib.
  5. Copy giraffa/lib/giraffa-standalone-VERSION-SNAPSHOT.jar to hbase/lib.
  6. In hbase/conf, create an empty hdfs-site.xml and core-site.xml: <?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration></configuration>
  7. In hadoop/conf, modify hdfs-site.xml: <?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration><property><name>fs.defaultFS</name><value>hdfs://localhost:9000</value></property></configuration>
  8. In hbase/conf, modify hbase-site.xml: <?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration><property><name>hbase.rootdir</name><value>hdfs://localhost:9000/hbase</value></property><property><name>hbase.coprocessor.master.classes</name><value>org.apache.giraffa.web.GiraffaWebObserver</value></property></configuration>
  9. Make sure environment variables HADOOP_HOME, HADOOP_COMMON_HOME, and HBASE_HOME are not set.
  10. Do giraffa/bin/giraffa namenode -format command first, so NameNode and DataNode start up properly. If this is a re-attempt you want to delete all your /tmp/hadoop and /tmp/hbase directories and files.
  11. Do giraffa/bin/start-giraffa.sh command.
  12. Do giraffa/bin/giraffa format command to format Giraffa.
  13. Do any giraffa/bin/giraffa fs -[op] command to create and access files in Giraffa, the same way as the hadoop fs -[op] command is used to access HDFS data.
  14. (Optional) Run TestBlockManagement from Eclipse, which executes TestBlockManagement.main(). This will write and read file(s).
  15. Use giraffa/bin/stop-giraffa.sh to stop the Giraffa cluster.

How to run Giraffa on a Cluster

  1. NOTES: This will set up a multinode Giraffa cluster by configuring the HDFS servers (NameNode and DataNodes), HBase servers (Master and RegionServers), and Giraffa Clients. You must know the hostnames of the nodes hosting these components, although they do not necessarily have to be unique. For example, in the Standalone Cluster, every component is hosted on the same node and therefore has the same hostname. However, there are restrictions: every component must be on the same LAN, there may be only one NameNode on the cluster, and there may be only one DataNode, RegionServer, and Master on a single node. In the following instructions, replace NAMENODE with the hostname of the node hosting the NameNode.
  2. PREREQUISITES: Follow steps 1 through 9 in "How to run Giraffa Standalone Cluster (Single-Node)" for every HDFS server, HBase server, and Giraffa Client. Ensure that giraffa is installed at the same location on each server. The rest of the instructions assume the current directory on each node is giraffa-standalone/target/giraffa-standalone.
  3. CONFIGURATION: The instructions below specify configuration files and map property names to values. These should be added or changed inside the <configuration></configuration> block of the files using the format: <property><name>NAME</name><value>VALUE</value></property>
    1. HDFS Configuration: On every NameNode and DataNode:
      • hadoop/conf/core-site.xml: fs.defaultFS => hdfs://NAMENODE:9000
      • hadoop/conf/hdfs-site.xml: fs.defaultFS => hdfs://NAMENODE:9000
    2. HDFS Configuration: On NameNode only:
      • hadoop/conf/masters: (should contain just one line that says: NAMENODE)
      • hadoop/conf/slaves: (list each DataNode hostname, one per line)
    3. HBase Configuration: On every Master and RegionServer:
      • hbase/conf/hbase-site.xml:
        • hbase.rootdir => hdfs://NAMENODE:9000/hbase
        • hbase.cluster.distributed => true
        • hbase.zookeeper.quorum => NAMENODE
    4. HBase Configuration: On Master only:
      • hbase/conf/regionservers: (list each RegionServer hostname, one per line)
    5. Giraffa Configuration: On every Giraffa Client:
      • giraffa/conf/core-site.xml:
        • hbase.rootdir => hdfs://NAMENODE:9000/hbase
        • hbase.coprocessor.master.classes => org.apache.giraffa.web.GiraffaWebObserver
        • hbase.cluster.distributed => true
        • hbase.zookeeper.quorum => NAMENODE
  4. STARTING:
    1. Start HDFS. Complete the following on the NameNode:
      • Run giraffa/bin/giraffa namenode -format. If this is a re-attempt, delete /tmp/hadoop and /tmp/hbase files first.
      • Run hadoop/bin/start-dfs.sh
    2. Start HBase: Complete the following on the Master:
      • Run hbase/bin/start-hbase.sh
    3. Format Giraffa: Complete the following on the NameNode:
      • Run giraffa/bin/giraffa format
    4. Verify: To check that start-up has completed successfully, run jps on each HDFS and HBase server. The NameNode should have process NameNode and SecondaryNameNode. The Master should have process HMaster. Each DataNode should have process DataNode. Each RegionServer should have process HRegionServer. The SecondaryNameNode process is not necessary for Giraffa and may be killed manually.
  5. RUNNING: Complete the following on a Giraffa Client:
    1. Do any giraffa/bin/giraffa fs -[op] command to create and access files in Giraffa, the same way as the hadoop fs -[op] command is used to access HDFS data.
  6. STOPPING:
    1. Run hadoop/bin/stop-hbase.sh on the Master
    2. Run hbase/bin/stop-dfs.sh on the NameNode
  7. WEB UI: Type HOSTNAME:PORT into the browser of any machine on the LAN to access the web UI of the following components (if this does not work, replace the hostname with the IP address, or alternatively, add the hostname/ip address pairs to your hosts file):
    • NameNode: Port 50070
    • DataNode: Port 50075
    • Master: Port 60010
    • RegionServer: Port 60030

YARN Setup

NOTE: YARN is not currently supported on any release; it is only unsupported on trunk.

YARN setup in Giraffa is identical to YARN setup in HDFS, with the exception that configuration files and executables are in a different location. In giraffa/conf, notice the following files:

yarn-env.sh
yarn-site.xml
mapred-env.sh
mapred-site.xml

Edit these files as you normally would. They have been pre-configured to run Tera jobs from the mapreduce examples jar. A couple of notes:

mapreduce.terasort.simplepartitioner is set to true. This is a configuration specific to the examples jar that ensures the distributed cache is not used. You should make sure that your jobs do not use the distributed cache as it requires currently unsupported features from Giraffa.

yarn.application.classpath is set to the default value, with the addition of $GIRAFFA_CLASSPATH. This ensures that Yarn jobs run with a class path that includes Giraffa. Do not remove $GIRAFFA_CLASSPATH from here.

Also notice the following files:

capacity-scheduler.xml
configuration.xsl
container-executer.cfg

These files are identical to the ones normally found in the hadoop configuration directory. If it turns out you need any additional configuration files, drop them in this directory.

In giraffa/bin, notice the following files:

yarn-giraffa
yarn-giraffa-daemon.sh

These are the Giraffa equivalents of the yarn and yarn-daemon.sh scripts you normally use to start jobs. For example, to start the resource manager and node manager, run:

yarn-giraffa-daemon.sh start resourcemanager
yarn-giraffa-daemon.sh start nodemanager

Then to run a teragen job from the examples jar, generating 10,000,000 rows in the directory “input”

yarn-giraffa jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.5-alpha.jar teragen 10000000 input

Misc. Notes

  • Running the command giraffa/bin/start-giraffa.sh will create a 1 node Giraffa cluster. It starts up NameNode, DataNode, and then HBase, which starts a RegionServer, Master, and ZooKeeper.
  • Run hadoop/bin/hadoop-daemon.sh start namenode or hadoop/bin/hadoop-daemon start datanode to manually start a NameNode or DataNode on a server. Likewise, run hbase/bin/hbase start master or hbase/bin/hbase start RegionServer to manually start an HBase Master or Region Server.
⚠️ **GitHub.com Fallback** ⚠️