Aleph2: Hive integration - IKANOW/Aleph2 GitHub Wiki

Hive integration

This page describes the install steps to integrate Hive as a data warehouse service for Aleph2, assuming use of Ambari

Aleph2 (ikanow-aleph2) 2.6.1606108+ must be installed.

  1. Create a directory /usr/hdp/current/lib/

  2. Download v2.2.0 of the elasticsearch hadoop install ZIP from here

  3. Unzip the ZIP and copy elasticsearch-hadoop-2.2.0.jar into /usr/hdp/current/lib/

  4. In Ambari, under the Hive service, config tab, advanced sub-tab, find the "hive-env template" setting (under "Advanced hive-env", and add the following line before export METASTORE_PORT={{hive_metastore_port}}:

    export HIVE_AUX_JARS_PATH="$HIVE_AUX_JARS_PATH:/usr/hdp/current/lib/elasticsearch-hadoop-2.2.0.jar:/usr/hdp/2.3.0.0-2557/hive/lib/commons-httpclient-3.0.1.jar"
    
  5. Restart the Hive services as prompted

  6. Install Hive clients on all Aleph2 nodes that run the ikanow-aleph2 daemon

On each of the Aleph2 nodes that run the ikanow-aleph2 daemon (eg the API nodes), create a symlink from /opt/aleph2-home/yarn-config to ln -sf /usr/hdp/current/hive-client/conf/hive-site.xml, eg

cd /opt/aleph2-home/yarn-config; ln -sf /usr/hdp/current/hive-client/conf/hive-site.xml

Bucket-level integration

This is described here (basic) and here (advanced).

ZoomData/Data Warehouse integration

This section describes how to integrate the ZoomData BI tool into Aleph2.

ZoomData/Hive

Create a home folder in HDFS for the user anonymous:

runuser hdfs -s /bin/sh -c "hadoop fs -mkdir /user/anonymous"
runuser hdfs -s /bin/sh -c "hadoop fs -chown anonymous /user/anonymous"

In the UI, create a "SPARKSQL" source ("HIVE_ON_TEZ" also works though is disabled by default and need to be enabled from the admin console - and it is equivalent in functionality to the "SPARKSQL" source)

In the source configuration, the JDBC URL should look like: jdbc:hive2://<HIVE2-SERVER-NAME>:10000

where HIVE2-SERVER-NAME can be found by looking in Ambari, Hive service, click on "HiveServer2" link.

User name and password should be left blank.

The available collections will be listed in "usual" Aleph2 signature format, eg /bucket/test/batch_flume_test generates the unique signature ``.

NOTE: Currently it is not be possible to handle nested fields using the UI. There is a workaround, which is to select "Custom SQL" and type something like

SELECT address.city,address.country,email.emailid FROM <BUCKET SIGNATURE>

ZoomData are considering fixing this (also an issue for SparkSQL) but there is no ETA.

NOTE I initially had some issues with Hive "TIMESTAMP" field - I think it cached an empty field and I had to log in/out and renamed the field to make it pick it up correctly.

ZoomData/Elasticsearch

Where possible, the Elasticsearch connector should be used as it is much faster. However there are also currently significant limitations.

Configuration is similar to above except the "ELASTICSEARCH" source is used.

Note the following bugs/limitations:

  • Objects with nested objects seem to cause problems with 2.0.19 - their support sa
  • It is currently not possible to select aliases, which means that timesliced buckets and "ping pong" buckets (analytic_thread.jobs.output.preserve_existing_data: false) cannot be well supported (you can handle timesliced buckets one time, and create 2 sources for the 2 ping/pong buffers and then switch between the manually).

ZoomData/SparkSQL

This integration is not currently supported by Aleph2.