Getting Started - ravipesala/astro GitHub Wiki

##Building Astro Astro is built using Apache Maven.

I. Clone and build Huawei-Spark/Spark-SQL-on-HBase

$ git clone https://github.com/HuaweiBigData/astro

II. Go to the root of the source tree

$ cd astro

III. Build the project Build without testing

$ mvn -DskipTests clean install

Or, build with testing. It will run test suites against a HBase minicluster.

$ mvn clean install

##Deployment It is required that all Spark slave machines be configured as HBase, and implicitly Zookeeper clients. It is preferable that the Spark and HBase cluster are co-located on the same set of physical or virtual boxes, but it is not actually a must. Coprocessor- and custom-filter-related HBase configurations and necessary jars containing corresponding logics from SparkSQL will be deployed to the HBase cluster. Specifically, the hbase-site.xml will need to be added the following four lines:

<property>
    <name>hbase.coprocessor.user.region.classes</name>
    <value>org.apache.spark.sql.hbase.CheckDirEndPointImpl</value>
 </property>

In the hbase-env.sh script, the HBASE_CLASSPATH need to add the Spark jar and the spark-hbase jar of this product. Supported HBase Version HBase 0.98 is supported.

##Configuration HBase configuration will be through either the Spark configurations with the conventional “spark.sql.hbase” prefix. Currently, there are four supported configuration flags:

spark.sql.hbase.partition.expiration specifies the expiration time (in seconds) of the cached HBase table region information. The default is 600 for 10 minutes.
spark.sql.hbase.scanner.fetchsize specifies the HBase scanner fetch size and defaults to 1000.
spark.sql.hbase.coprocessor is a Boolean to switch on/off the use of coprocessors.
spark.sql.hbase.customfilter is a Boolean to switch on/off the use of custom filters. Explanations of the configuration flags are also included in their respective related sections.

Please refer to the Configuration guide for Spark related configurations.

##Quick Start The easiest way to start using Astro is through the shell:

Interactive Scala Shell

In this shell Spark context and as well as HBaseSQLContext is already created. All the commands which are issued in this shell should be SQL apart from HELP and EXIT.

>./bin/hbase-sql
Welcome to hbaseql CLI
astro>show tables;
OK
+---------+-----------+
|tableName|isTemporary|
+---------------------+
|Employee |      false|
+---------------------+

Time taken : 1.231 seconds
astro>help;
Usage: HELP Statement
      Statement:
         CREATE | DROP | ALTER | LOAD | SELECT | INSERT | DESCRIBE | SHOW

Python Shell

First, add the spark-hbase jar to the SPARK_CLASSPATH in the $SPARK_HOME/conf directory, as follows:

SPARK_CLASSPATH=$SPARK_CLASSPATH:/spark-hbase-root-dir/target/spark-sql-on-hbase-1.0.0.jar

Then go to the Astro installation directory and issue

./bin/pyspark-hbase

A successfull message is as follows:

You are using Spark SQL on HBase!!! HBaseSQLContext available as hsqlContext.

To run a python script, the PYTHONPATH environment should be set to the "python" directory of the Spark-HBase installation. For example,

export PYTHONPATH=/root-of-Spark-HBase/python

Note that the shell commands are not included in the Zip file of the Spark release. They are for developers' use only for this version of 1.0.0. Instead, users can use "$SPARK_HOME/bin/spark-shell --packages Huawei-Spark/Spark-SQL-on-HBase:1.0.0" for SQL shell or "$SPARK_HOME/bin/pyspark --packages Huawei-Spark/Spark-SQL-on-HBase:1.0.0" for Python shell.

Running Tests

Testing first requires building Spark HBase. Once Spark HBase is built ...

Run all test suites from Maven:

mvn -Phbase,hadoop-2.4 test

Run a single test suite from Maven, for example:

mvn -Phbase,hadoop-2.4 test -DwildcardSuites=org.apache.spark.sql.hbase.BasicQueriesSuite

IDE Setup

We use IntelliJ IDEA for Spark HBase development. You can get the community edition for free and install the JetBrains Scala plugin from Preferences > Plugins.

To import the current Spark HBase project for IntelliJ:

Download IntelliJ and install the Scala plug-in for IntelliJ. You may also need to install Maven plug-in for IntelliJ.
Go to "File -> Import Project", locate the Spark HBase source directory, and select "Maven Project".
In the Import Wizard, select "Import Maven projects automatically" and leave other settings at their default.
Make sure some specific profiles are enabled. Select corresponding Hadoop version, "maven3" and also"hbase" in order to get dependencies.
Leave other settings at their default and you should be able to start your development.
When you run the scala test, sometimes you will get out of memory exception. You can increase your VM memory usage by the following setting, for example:

-XX:MaxPermSize=512m -Xmx3072m

You can also make those setting to be the default by setting to the "Defaults -> ScalaTest".