Astro Documentation - ravipesala/astro GitHub Wiki

Astro Documentation

##Overview HBase access mechanism is very primitive and only through client-side APIs, Map/Reduce interfaces and interactive shells. SQL accesses to HBase data are available through Map/Reduce or interfaces mechanisms such as Apache Hive and Impala, or some “native” SQL technologies like Apache Phoenix. While the former is usually cheaper to implement and use, their latencies and efficiencies often cannot compare favorably with the latter and are often suitable only for offline analysis. The latter category, in contrast, often performs better and qualifies more as online engines; they are often on top of purpose-built execution engines.

Currently Spark supports queries against HBase data through HBase’s Map/Reduce interface (i.e., TableInputFormat). SparkSQL supports use of Hive data, which theoretically should be able to support HBase data access, out-of-box, through HBase’s Map/Reduce interface and therefore falls into the first category of the “SQL on HBase” technologies.
We believe, as a unified big data processing engine, Spark is in good position to provide better HBase support.

##SQL Support Queries and data types will be the same as what SparkSQL supports. The differences will be in DDL and DML.

###DDL Note that all DDL statements only affect the logical SQL table and not the physical tables.

####CREATE TABLE A create table statement will be of the form of:

CREATE TABLE table_name (col1 TYPE1, col2 TYPE2, …, PRIMARY KEY (col7, col1, col3)) 
MAPPED BY (hbase_tablename, COLS=[col2=cf1.cq11, col4=cf1.cq12, col5=cf2.cq21, col6=cf2.cq22])

A SQL table on HBASE is basically a logical table mapped to a HBase table. This mapping can be many-to-one to support “schema-on-read” for SQL access to HBase data.

  • “hbase_table_name” denotes the HBase table
  • “primary key” constraint denotes the HBase row key composition of columns
  • “col2=cf1.cq1” denotes the mapping of the second column to the HBase tables column qualifier of “cq1” of column family “cf1”. Note : The table and the column families specified have to exist in HBase for the CREATE TABLE statement to succeed. In addition, the columns in the primary key cannot be mapped to another column family/column qualifier combo. Other normal SQL sanity checks, such as uniqueness of logical columns, will be applied as well.
    Row Key Composition :
    HBase row keys will be composed in the way of Big Endian, for processing efficiency. Keys, or key components, of the STRING type are marked with a NULL terminator.

####DROP TABLE A drop table statement is of the form of

DROP TABLE table_name

This will not delete the HBase table the SQL table maps to, but just deletes the SQL table with its schema.

####ALTER TABLE

ALTER TABLE table_name DROP column

Drops an existing column from the SQL table.

ALTER TABLE table_name ADD col1 TYPE1 MAPPED BY (col1 = cf.cq)

Adds a new column that is mapped to existing column family “cf” and column qualifier “cq”.

ALTER TABLE does not support addition or deletion of components in the composite row key

###DML

####INSERT

The syntax remains the same as SchemaRDD’s. One constraint is that all columns in the HBASE key must be present for insertion to succeed. Normal SQL sanity checks for INSERT, such as uniqueness of logical columns, will be applied. There are two types of inserts. The first has the following syntax:

INSERT INTO TABLE table_name  VALUES (col1_value, col2_value, …)

While the second has

INSERT INTO TABLE table1_name SELECT … FROM table2_name

####Bulk Loading

LOAD DATA [PARALLEL] INPATH filePath [OVERWRITE] INTO TABLE tableName [FIELDS TERMINATED BY char]

It is similar to Hive command, it is used to invoke a bulk loading into existing HBase tables. A “parallel” option will merge the “incremental loading” phase into the HFile generation phase. Conceivably it will perform better, particularly for non-presplit tables.

###Miscellaneous Commands *SHOW TABLE lists the tables inside Astro. *DESCRIBE tablename gives discription of table.

##Data Frame

Data Frame’s functionalities are supported in Astro. An example statement is as follows:

val  hbaseContext =  new HBaseSQLContext(sc)
hbaseContext.read.format(“org.apache.spark.sql.hbase.HBaseSource”).options(
    Map("namespace" -> "", "tableName" -> "people", "hbaseTableName" -> "people_table",
      "colsSeq" -> "name,age,id,address",
      "keyCols" -> "id,integer",
      "nonKeyCols" ->   
        "name,string,cf1,cq_name;age,integer,cf1,cq_age;address,string,cf2,cq_address")).load
hbaseContext.sql("Select  `personal_data:name`, `personal_data:identification` as b, `personal_data

hbaseContext.sql("Select  name,  id, address  from people").collect.foreach(println)

There will be a few potentially subtle difference worth of caution. In particular, the methods of “registerAsTable” will not “register” a HBase-based table but a usual SparkSQL table. On the other hand, “insertInto” and “saveAsTable method will insert data into an existing SQL table on top of a HBase table For detail information about Data Frame

Activate Coprocessor and Custom Filter in HBase

First, add the path of spark-hbase jar to the hbase-env.sh in $HBASE_HOME/conf directory, as follows:

HBASE_CLASSPATH=$HBASE_CLASSPATH:/spark-hbase-root-dir/target/spark-sql-on-hbase-1.0.0.jar

Then, register the coprocessor service 'CheckDirEndPoint' to hbase-site.xml in the same directory, as follows:

<property>
    <name>hbase.coprocessor.region.classes</name>
    <value>org.apache.spark.sql.hbase.CheckDirEndPointImpl</value>
</property>

(Warning: Don't register another coprocessor service 'SparkSqlRegionObserver' here !)

##Metadata Persistence

Table meta data is stored in a HBase table named “SPARK_SQL_HBASE_TABLE”, of a single column family named “CF”. Each SQL table uses a single row in the HBase table. And each column will store the name and type encoding of a column of the SQL table.

##FAQs

  1. **Q:**Is there any plan to support multiple scan threads per query per region?
    **A:**Multi scan threads could boost scan parallelism. The similar effect could be achieved through smaller regions. A possible down side of parallel scan is that it could result random I/O. Nevertheless it could be added readily as required.
  2. **Q:**Are there other possible, yet not realized in the current release, optimization techniques on organized data sets like those on HBase?
    **A:**Yes. Actually there could be a bunch of such opportunities. We will, again, adopt them in a phased approach and would like to hear demands and feedbacks from the users.
⚠️ **GitHub.com Fallback** ⚠️