Execution Engine for Apache Hadoop - stanislawbartkowski/CP4D GitHub Wiki
https://www.ibm.com/support/producthub/icpdata/docs/content/SSQNUZ_current/svc-welcome/hadoopaddon.html
Another installation description step by step: https://developer.ibm.com/recipes/tutorials/i-can-integrate-ibm-cloud-pak-for-data-with-hadoop-for-hdp-cdh/
Execution Engine for Apache Hadoop is one of the services available on IBM Cloud Pak for Data platform. The service integrates CP4D and Hadoop, two Hadoop implementations are supported: CDH (Cloudera) and HDP (former HortonWorks, now Cloudera).
An installation of the service requires installation and configuration on both sides, Hadoop and CP4D. Steps necessary to activate the service are described here: https://www.ibm.com/support/producthub/icpdata/docs/content/SSQNUZ_current/wsj/install/install-hadoop.html.
In the rest of this article, there is a description of practical steps to activate the service.
In CPD3.x, the Hive access through Execution Engine for Hadoop supports only Cloudera Hadoop distribution. HDP (HortonData Platform) is not supported.
In this article, the following hostnames are used.
Role | Hostname |
---|---|
Edge node for Hadoop part | exile1.fyre.ibm.com |
HDFS Name node | bushily1.fyre.ibm.com |
CP4D node | https://zen-cpd-zen.apps.rumen.os.fyre.ibm.com:443 |
Livy URL | http://sawtooth1.fyre.ibm.com:8999 |
Hadoop version: HDP
The installation follows the standard pattern of installing CP4D service. Assembly name: hadoop-addon
bin/cpd-linux -s repo.yaml -a hadoop-addon --verbose --target-registry-password $(oc whoami -t) --target-registry-username kubeadmin -c managed-nfs-storage --insecure-skip-tls-verify -n zen
Verify
bin/cpd-linux status --assembly hadoop-addon --namespace zen
Packages are obtained through PA pages. An examples: WSPre_Hadoop_Exec_Eng_3011.tar. Unpack and install:
yum install dsxhi-dsx-3.0.1-476.noarch.rpm
Directories after installation.
Directory | Description |
---|---|
/opt/ibm/dsxhi | Binaries and configuration |
/var/log/dsxhi | Log files |
Create dsxhi service user and make him the owner of Hadoop Engine files.
addser dsxhi
chown -R /opt/ibm/dsxhi
Collect Hadoop endpoints:
HDP
Endpoint | Example |
---|---|
Hive URL | jdbc:hive2://bushily2.fyre.ibm.com:10000 |
Ambari URL | http://exile1.fyre.ibm.com:8080 |
Ambari credentials | admin/admin |
Cloudera
Endpoint | Example |
---|---|
Cloudera Manager URL | http://aliquant1.fyre.ibm.com:7180 |
Cloudera Manager credentials | admin/admin |
cd /opt/ibm/dsxhi/conf
cp dsxhi_install.conf.template.HDP dsxhi_install.conf
or (Cloudera)
cp dsxhi_install.conf.template.CDH dsxhi_install.conf
Customize dsxhi_install.conf property file according to HDP cluster configuration. Below is an example.
As a minimum, it is enough to enable webhdfs service.
vi dsxhi_install.conf
...
# Mandatory - Specify the username and group of the user (dsxhi service user)
# running the dsxhi service.
dsxhi_serviceuser=dsxhi
dsxhi_serviceuser_group=dsxhi
....
# Mandatory - Specify the Ambari URL and admin username for the HDP cluster.
# User will be prompted for password during installation. If the url is not
# specified, some pre-checks for the install will not be performed.
cluster_manager_url=http://exile1.fyre.ibm.com:8080
cluster_admin=admin
...........
# Optional - Specify the hadoop services that dsxhi service should expose.
#exposed_hadoop_services=webhdfs,webhcat,livyspark2,jeg
exposed_hadoop_services=webhdfs,livyspark2,jeg
..........
# Optional - Specify the list of URL for the dsx local clusters that will
# register this dsxhi service. The URL should include the port number if
# necessary.
#known_dsx_list=https://dsxlcluster1.ibm.com,https://dsxlcluster2.ibm.com:31843
...........
# Optional - Provide client side Hive JDBC url:
# (e.g., hive_jdbc_client_url=jdbc:hive2://remotehost:port)
hive_jdbc_client_url=jdbc:hive2://bushily2.fyre.ibm.com:10000
HDP: Ambari Console->HDFS->Configs->Advanced->Custom core-site
Cloudera: Cloudera Manager -> HDFS -> Configuration -> HDFS (Service Wide) -> Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml
Property | Value |
---|---|
hadoop.proxyuser.dsxhi.groups | * |
hadoop.proxyuser.dsxhi.hosts | * |
hadoop.proxyuser.dsxhi.users | * |
Restart all HDP/Cloudera services impacted.
cd /opt/ibm/dsxhi/bin
./install.py
If WebHDFS is listening on a secure port, add WebHDFS certificates to Hadoop Engine truststore. It is done automatically by a script.
cd /opt/ibm/dsxhi/bin
util/add_cert.sh bushily1.fyre.ibm.com:50470
cd /opt/ibm/dsxhi/bin
./manage_known_dsx.py -a https://zen-cpd-zen.apps.rumen.os.fyre.ibm.com:443
Verify CP4D and DSXHI endpoints.
./manage_known_dsx.py -l
DSX Local Cluster URL DSXHI Service URL
https://zen-cpd-zen.apps.rumen.os.fyre.ibm.com:443 https://exile1.fyre.ibm.com:8443/gateway/zen-cpd-zen
After registering CP4D, a corresponding endpoint specification is created.
ll ../gateway/conf/topologies/zen-cpd-zen.xml
-rw-r--r-- 1 dsxhi dsxhi 2924 11-01 12:44 ../gateway/conf/topologies/zen-cpd-zen.xml
cd /opt/ibm/dsxhi/bin
./start.py
./status.py
./stop.py
CP4D Console->Platform configuration->Systems integration->New integration
Property | Value |
---|---|
Name | (any name) MyCluster |
Service user ID | dsxhi |
Service URL | (output from ./manage_known_dsx.py -l command) https://exile1.fyre.ibm.com:8443/gateway/zen-cpd-zen |
All runtimes panel:
Push "Jupyter with Python". It can take a number of minutes until completed. Make sure that Status is reported as "Push succeeded".
Add a connection to Watson Studio Project.
Project->Assets->Add to project (top button)->Connection->HDFS via Execution Engine for Hadoop
WebHDFS URL: /webhdfs/v1
URL: The Service URL provided while registering Hadoop Cluster
Example: https://exile1.fyre.ibm.com:8443/gateway/zen-cpd-zen/webhdfs/v1
SSL Certificate
Copy and paste certificate sent by Hadoop Engine gateway
openssl s_client -connect exile1.fyre.ibm.com:8443
Hadoop/HDFS.
echo "Hello" >hello.txt
echo "I am" >>hello.txt
echo "Stanisław" >>hello.txt
hdfs dfs -copyFromLocal hello.txt /tmp
hdfs dfs -ls /tmp
Found 1 items
-rw-r--r-- 3 hdfs admin 6 2020-11-02 22:24 /tmp/hello.txt
Cloud Pak for Data -> Connection -> Platform connections -> New connection -> HDFS via Execution Engine for Hadoop
As WebHDFS URL copy and paste WebHDFS endpoint from Hadoop platform integration panel.
Example: https://internal-nginx-svc:12443/ibm-dsxhi-cdpcluster/webhdfs/v1
Project -> Add to project -> Connection -> From platform -> HDFSWebHDFSConnection (created as above) -> Create
Project->Add to project (top button)->Connected data->Select source
After opening the notebook, Watson Studio can inject a Python code to get access data to remote HDFS file.
Data -> Files -> HelloDataSet -> Insert to code -> Credentials
Using the data injected, we can create an URL and get access to HDFS using KNOX HDFS Rest/API. https://cwiki.apache.org/confluence/display/KNOX/Examples+WebHDFS
This cell is generated automatically.
# @hidden_cell
# The following code contains the credentials for a connection in your Project.
# You might want to remove those credentials before you share your notebook.
from ibm_watson_studio_lib import access_project_or_space
wslib = access_project_or_space()
HelloDataSet_credentials = wslib.get_connected_data("HelloDataSet")h = Hello_txt_credentials
Insert Python code
h = HelloDataSet_credentials
URL = url = h['url'] + h['datapath']
headers = {'authorization': 'Bearer ' + h['access_token']}
params = {'op' : 'GETFILESTATUS'}
import requests
r = requests.get(url,headers=headers,params=params)
print(r.content)
Output: b'{"FileStatus":{"accessTime":1610404702071,"blockSize":134217728,"childrenNum":0,"fileId":21279,"group":"admin","length":22,"modificationTime":1610404702276,"owner":"hdfs","pathSuffix":"","permission":"644","replication":3,"storagePolicy":0,"type":"FILE"}}'
# read HDFS file
params= {'op': 'OPEN'}
r = requests.get(url,headers=headers,params=params)
print(r.text)
Output: 'Hello\nI am\nStanisław\n'
Watson Studio Jupyter Notebook can run distributed Spark application on remote Hadoop cluster using HDFS/Hadoop resources. The integration is done by JEG (Jupyter Enterprise Gateway), one of CP4D Hadoop Execution Engine services.
Make sure that JEG endpoint is defined in CPD4D Hadoop integration panel.
Make sure that JEG service is started and running.
cd /opt/ibm/dsxhi/bin
./status.py
--Status of DSXHI REST server
dsxhi rest is running with PID 11621.
--Status of JEG
JEG is running with PID 11847.
--Status of gateway server
Gateway is running with PID 11909.
admin user cannot start remote JEG session. It is hardcoded in the configuration file.
vi /opt/ibm/dsxhi/jeg/bin/jeg.sh
function appStart {
export PYTHON_PATH=${VENV_DIR}/lib/python3.6
export PATH=${VENV_DIR}/bin:$PATH
# export EG_UNAUTHORIZED_USERS="root,admin,${dsxhi_serviceuser}"
export EG_UNAUTHORIZED_USERS="root,${dsxhi_serviceuser}"
Either remove admin from an excluded list (not recommended) or logon to CP4D as a different user (recommended).
Make sure that SPARK_HOME variable is properly set in kernel specification. It could happen if Spark was installed in HDP after installation of Execution Engine for Hadoop.
vi /opt/ibm/dsxhi/jeg/kernelspecs/kernels/spark_python_yarn_cluster/kernel.json
{
"language":"python3",
"display_name":"Spark - Python (YARN Cluster Mode)",
"metadata":{
"process_proxy":{
"class_name":"enterprise_gateway.services.processproxies.yarn.YarnClusterProcessProxy"
}
},
"env":{
"SPARK_HOME":"/usr/hdp/current/spark2-client",
"PYSPARK_PYTHON":"/usr/bin/python",
"PYTHONPATH":"/opt/ibm/dsxhi/lib/virtualenvs/jeg_base/lib/python3.6/site-packages:/python",
"SPARK_OPTS":"--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.sql.catalogImplementation=hive --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=.local ${KERNEL_EXTRA_SPARK_OPTS}",
"LAUNCH_OPTS":"",
"HADOOP_CONF_DIR":"",
"EG_IMPERSONATION_ENABLED":"False",
"EG_KERNEL_LAUNCH_TIMEOUT":"120"
},
"argv":[
"/opt/ibm/dsxhi/jeg/kernelspecs/spark_python_yarn_cluster/bin/run.sh",
"--RemoteProcessProxy.kernel-id",
"{kernel_id}",
"--RemoteProcessProxy.response-address",
"{response_address}",
"--RemoteProcessProxy.port-range",
"{port_range}",
"--RemoteProcessProxy.spark-context-initialization-mode",
"lazy"
]
}
Cloudera: SPARK_HOME looks like:
"SPARK_HOME": "/opt/cloudera/parcels/CDH-7.1.5-1.cdh7.1.5.p0.7431829/lib/spark
Project->Environment-> New environment definition
Project->Add to project->Notebook->Select runtime (JEG environment)
!hostname
rr = spark.sparkContext.textFile("/user/admin/hello.txt")
print(rr.take(3))
SparkSQL example.
from os.path import abspath
from pyspark.sql import SparkSession
from pyspark.sql import Row
warehouse_location = abspath('spark-warehouse')
spark = SparkSession \
.builder \
.appName("Python Spark SQL") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
spark.sql("SHOW DATABASES").show()
Define dsxhi user as superuser in Cloudera Livy.
Cloudera Manager -> Livy -> Configuration -> Admin Users/livy.superusers
Follow the Livy example:
Make sure that Livy session is created in Cloudera, it can take several minutes to be ready.
%%spark -s $session_name
rr = spark.sparkContext.textFile("/tmp/hello.txt")
print(rr.take(3))
Register dsxhi user in Kerberos and obtain appropriate keytab file. Modify Hadoop Engine configuration file.
vi /opt/ibm/dsxhi/conf/dsxhi_install.conf
# If the HDP cluster is kerberized, it is mandatory to specify the complete
# path to the keytab for the dsxhi service user and the spnego keytab.
# If the HDP cluster is not kerberized, this field should be left blank.
dsxhi_serviceuser_keytab=/etc/security/keytabs/dsxhi.keytab
dsxhi_spnego_keytab=/etc/security/keytabs/spnego.service.keytab
In Cloudera (CDP) it is necessary to activate SPNEGO feature.
https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/security-kerberos-authentication/topics/cm-security-external-authentication-spnego.html
Next step is to identify SPNEGO keytab on edge node. In CDP cluster, all keytabs are stored in /var/run/cloudera-scm-agent/process directory.
klist -kt /var/run/cloudera-scm-agent/process/1546338628-hdfs-DATANODE/hdfs.keytab
Keytab name: FILE:/var/run/cloudera-scm-agent/process/1546338628-hdfs-DATANODE/hdfs.keytab
KVNO Timestamp Principal
---- ------------------- ------------------------------------------------------
1 07/25/2021 01:08:57 HTTP/[email protected]
1 07/25/2021 01:08:57 hdfs/[email protected]
This keytab contains SPNEGO (HTTP) principal. Copy it to /home/dsxhi directory and make it accessible by dsxhi user.
cp /var/run/cloudera-scm-agent/process/1546338628-hdfs-DATANODE/hdfs.keytab /home/dsxhi/
chown dsxhi: /home/dsxhi/hdfs.keytab
vi /opt/ibm/dsxhi/conf/dsxhi_install.conf
# If the CDH cluster is kerberized, it is mandatory to specify the complete
# path to the keytab for the dsxhi service user and the spnego keytab.
# If the CDH cluster is not kerberized, these properties should be left blank.
dsxhi_serviceuser_keytab=/home/dsxhi/dsxhi.keytab
dsxhi_spnego_keytab=/home/dsxhi/hdfs.keytab
If HDP was Kerberized after Hadoop Engine installation, it is necessary to reinstall it again.
cd /opt/ibm/dsxhi/bin
./manage_known_dsx.py -a https://zen-cpd-zen.apps.rumen.os.fyre.ibm.com:443
./uninstall.py
./install.py
Also in CP4D, the Hadoop integration should be recreated again.
Hadoop proxy user dsxhi impersonates CP4D user to launch Yarn Spark job. Following steps should be performed for CP4D users:
- register in Kerberos/AD
- create HDFS /user/{CP4D user} directory and make CP4D user the owner of this directory
- if Ranger is installed, create a policy to authorize CP4D users to submit jobs in Yarn
CP4D users can be mapped into any other HDP/Cloudera user.
https://knox.apache.org/books/knox-0-12-0/user-guide.html#Default+Identity+Assertion+Provider
Modif Hadoop Engine gateway configuration file:
vi /opt/ibm/dsxhi/gateway/conf/topologies/zen-cpd-zen.xml
..........
<provider>
<role>identity-assertion</role>
<name>Default</name>
<enabled>true</enabled>
<param>
<name>principal.mapping</name>
<value>admin=guest;</value>
</param>
</provider>
.........
Restart the gateway:
cd /opt/ibm/dsxhi/bin
./stop.py
./start.py
The CP4D admin user will reach the Hadoop cluster as guest user. All prerequisites regarding Hadoop users (Kerberos registering, HDFS home directory and Yarn authorization) should be applied to guest user as well.