Aleph2: Hive integration - IKANOW/Aleph2 GitHub Wiki
Hive integration
This page describes the install steps to integrate Hive as a data warehouse service for Aleph2, assuming use of Ambari
Aleph2 (ikanow-aleph2
) 2.6.1606108+ must be installed.
-
Create a directory
/usr/hdp/current/lib/
-
Download v2.2.0 of the elasticsearch hadoop install ZIP from here
-
Unzip the ZIP and copy
elasticsearch-hadoop-2.2.0.jar
into/usr/hdp/current/lib/
-
In Ambari, under the Hive service, config tab, advanced sub-tab, find the "hive-env template" setting (under "Advanced hive-env", and add the following line before
export METASTORE_PORT={{hive_metastore_port}}
:export HIVE_AUX_JARS_PATH="$HIVE_AUX_JARS_PATH:/usr/hdp/current/lib/elasticsearch-hadoop-2.2.0.jar:/usr/hdp/2.3.0.0-2557/hive/lib/commons-httpclient-3.0.1.jar"
-
Restart the Hive services as prompted
-
Install Hive clients on all Aleph2 nodes that run the
ikanow-aleph2
daemon
On each of the Aleph2 nodes that run the ikanow-aleph2
daemon (eg the API nodes), create a symlink from /opt/aleph2-home/yarn-config
to ln -sf /usr/hdp/current/hive-client/conf/hive-site.xml
, eg
cd /opt/aleph2-home/yarn-config; ln -sf /usr/hdp/current/hive-client/conf/hive-site.xml
Bucket-level integration
This is described here (basic) and here (advanced).
ZoomData/Data Warehouse integration
This section describes how to integrate the ZoomData BI tool into Aleph2.
ZoomData/Hive
Create a home folder in HDFS for the user anonymous
:
runuser hdfs -s /bin/sh -c "hadoop fs -mkdir /user/anonymous"
runuser hdfs -s /bin/sh -c "hadoop fs -chown anonymous /user/anonymous"
In the UI, create a "SPARKSQL" source ("HIVE_ON_TEZ" also works though is disabled by default and need to be enabled from the admin console - and it is equivalent in functionality to the "SPARKSQL" source)
In the source configuration, the JDBC URL should look like: jdbc:hive2://<HIVE2-SERVER-NAME>:10000
where HIVE2-SERVER-NAME
can be found by looking in Ambari, Hive service, click on "HiveServer2" link.
User name and password should be left blank.
The available collections will be listed in "usual" Aleph2 signature format, eg /bucket/test/batch_flume_test
generates the unique signature ``.
NOTE: Currently it is not be possible to handle nested fields using the UI. There is a workaround, which is to select "Custom SQL" and type something like
SELECT address.city,address.country,email.emailid FROM <BUCKET SIGNATURE>
ZoomData are considering fixing this (also an issue for SparkSQL) but there is no ETA.
NOTE I initially had some issues with Hive "TIMESTAMP" field - I think it cached an empty field and I had to log in/out and renamed the field to make it pick it up correctly.
ZoomData/Elasticsearch
Where possible, the Elasticsearch connector should be used as it is much faster. However there are also currently significant limitations.
Configuration is similar to above except the "ELASTICSEARCH" source is used.
Note the following bugs/limitations:
- Objects with nested objects seem to cause problems with 2.0.19 - their support sa
- It is currently not possible to select aliases, which means that timesliced buckets and "ping pong" buckets (
analytic_thread.jobs.output.preserve_existing_data: false
) cannot be well supported (you can handle timesliced buckets one time, and create 2 sources for the 2 ping/pong buffers and then switch between the manually).
ZoomData/SparkSQL
This integration is not currently supported by Aleph2.