Execution Engine for Apache Hadoop - stanislawbartkowski/CP4D GitHub Wiki

Introduction

https://www.ibm.com/support/producthub/icpdata/docs/content/SSQNUZ_current/svc-welcome/hadoopaddon.html
Another installation description step by step: https://developer.ibm.com/recipes/tutorials/i-can-integrate-ibm-cloud-pak-for-data-with-hadoop-for-hdp-cdh/

Execution Engine for Apache Hadoop is one of the services available on IBM Cloud Pak for Data platform. The service integrates CP4D and Hadoop, two Hadoop implementations are supported: CDH (Cloudera) and HDP (former HortonWorks, now Cloudera).

An installation of the service requires installation and configuration on both sides, Hadoop and CP4D. Steps necessary to activate the service are described here: https://www.ibm.com/support/producthub/icpdata/docs/content/SSQNUZ_current/wsj/install/install-hadoop.html.

In the rest of this article, there is a description of practical steps to activate the service.

Prerequisites

In CPD3.x, the Hive access through Execution Engine for Hadoop supports only Cloudera Hadoop distribution. HDP (HortonData Platform) is not supported.

https://www.ibm.com/support/producthub/icpdata/docs/content/SSQNUZ_current/wsj/install/installing-hadoop-cluster.html

In this article, the following hostnames are used.


Role Hostname
Edge node for Hadoop part exile1.fyre.ibm.com
HDFS Name node bushily1.fyre.ibm.com
CP4D node https://zen-cpd-zen.apps.rumen.os.fyre.ibm.com:443
Livy URL http://sawtooth1.fyre.ibm.com:8999

Hadoop version: HDP

Install Execution Engine for Apache Hadoop

https://www.ibm.com/support/producthub/icpdata/docs/content/SSQNUZ_current/cpd/svc/wsj/install-hadoop.html

The installation follows the standard pattern of installing CP4D service. Assembly name: hadoop-addon

bin/cpd-linux -s repo.yaml -a hadoop-addon --verbose --target-registry-password $(oc whoami -t) --target-registry-username kubeadmin -c managed-nfs-storage --insecure-skip-tls-verify -n zen


Verify

bin/cpd-linux status --assembly hadoop-addon --namespace zen

Install HEE component on HDP cluster

https://www.ibm.com/support/producthub/icpdata/docs/content/SSQNUZ_current/wsj/install/installing-hadoop-cluster.html

Obtain and install rpm packages

Packages are obtained through PA pages. An examples: WSPre_Hadoop_Exec_Eng_3011.tar. Unpack and install:

yum install dsxhi-dsx-3.0.1-476.noarch.rpm

Directories after installation.

Directory Description
/opt/ibm/dsxhi Binaries and configuration
/var/log/dsxhi Log files

Service user

Create dsxhi service user and make him the owner of Hadoop Engine files.

addser dsxhi
chown -R /opt/ibm/dsxhi

Configuration

Collect Hadoop endpoints:

HDP

Endpoint Example
Hive URL jdbc:hive2://bushily2.fyre.ibm.com:10000
Ambari URL http://exile1.fyre.ibm.com:8080
Ambari credentials admin/admin

Cloudera

Endpoint Example
Cloudera Manager URL http://aliquant1.fyre.ibm.com:7180
Cloudera Manager credentials admin/admin

cd /opt/ibm/dsxhi/conf
cp dsxhi_install.conf.template.HDP dsxhi_install.conf

or (Cloudera)

cp dsxhi_install.conf.template.CDH dsxhi_install.conf

Customize dsxhi_install.conf property file according to HDP cluster configuration. Below is an example.

As a minimum, it is enough to enable webhdfs service.

vi dsxhi_install.conf

...
# Mandatory - Specify the username and group of the user (dsxhi service user)
# running the dsxhi service.
dsxhi_serviceuser=dsxhi
dsxhi_serviceuser_group=dsxhi
....
# Mandatory - Specify the Ambari URL and admin username for the HDP cluster.
# User will be prompted for password during installation. If the url is not
# specified, some pre-checks for the install will not be performed.
cluster_manager_url=http://exile1.fyre.ibm.com:8080
cluster_admin=admin
...........
# Optional - Specify the hadoop services that dsxhi service should expose.
#exposed_hadoop_services=webhdfs,webhcat,livyspark2,jeg
exposed_hadoop_services=webhdfs,livyspark2,jeg
..........
# Optional - Specify the list of URL for the dsx local clusters that will
# register this dsxhi service. The URL should include the port number if
# necessary.
#known_dsx_list=https://dsxlcluster1.ibm.com,https://dsxlcluster2.ibm.com:31843
...........
# Optional - Provide client side Hive JDBC url:
# (e.g., hive_jdbc_client_url=jdbc:hive2://remotehost:port)
hive_jdbc_client_url=jdbc:hive2://bushily2.fyre.ibm.com:10000

Configure HDFS

HDP: Ambari Console->HDFS->Configs->Advanced->Custom core-site

Cloudera: Cloudera Manager -> HDFS -> Configuration -> HDFS (Service Wide) -> Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml

Property Value
hadoop.proxyuser.dsxhi.groups *
hadoop.proxyuser.dsxhi.hosts *
hadoop.proxyuser.dsxhi.users *

Restart all HDP/Cloudera services impacted.

Install

cd /opt/ibm/dsxhi/bin
./install.py

WebHDFS on SSL

If WebHDFS is listening on a secure port, add WebHDFS certificates to Hadoop Engine truststore. It is done automatically by a script.

cd /opt/ibm/dsxhi/bin
util/add_cert.sh bushily1.fyre.ibm.com:50470

Register CP4D URL

cd /opt/ibm/dsxhi/bin
./manage_known_dsx.py -a https://zen-cpd-zen.apps.rumen.os.fyre.ibm.com:443

Verify CP4D and DSXHI endpoints.

./manage_known_dsx.py -l

DSX Local Cluster URL			                                            DSXHI Service URL
https://zen-cpd-zen.apps.rumen.os.fyre.ibm.com:443			https://exile1.fyre.ibm.com:8443/gateway/zen-cpd-zen

After registering CP4D, a corresponding endpoint specification is created.

ll ../gateway/conf/topologies/zen-cpd-zen.xml

-rw-r--r-- 1 dsxhi dsxhi 2924 11-01 12:44 ../gateway/conf/topologies/zen-cpd-zen.xml

Start/Stop the service

cd /opt/ibm/dsxhi/bin
./start.py
./status.py
./stop.py

Register Hadoop Engine in CP4D

CP4D Console->Platform configuration->Systems integration->New integration

Property Value
Name (any name) MyCluster
Service user ID dsxhi
Service URL (output from ./manage_known_dsx.py -l command)
https://exile1.fyre.ibm.com:8443/gateway/zen-cpd-zen

All runtimes panel:
Push "Jupyter with Python". It can take a number of minutes until completed. Make sure that Status is reported as "Push succeeded".

Verify

Verify HDFS

Create connection

Add a connection to Watson Studio Project.
Project->Assets->Add to project (top button)->Connection->HDFS via Execution Engine for Hadoop

WebHDFS URL: /webhdfs/v1
URL: The Service URL provided while registering Hadoop Cluster
Example: https://exile1.fyre.ibm.com:8443/gateway/zen-cpd-zen/webhdfs/v1

SSL Certificate
Copy and paste certificate sent by Hadoop Engine gateway

openssl s_client -connect exile1.fyre.ibm.com:8443

Prepare test data in HDFS

Hadoop/HDFS.

echo "Hello" >hello.txt
echo "I am" >>hello.txt
echo "Stanisław" >>hello.txt
hdfs dfs -copyFromLocal hello.txt /tmp
hdfs dfs -ls /tmp

Found 1 items
-rw-r--r--   3 hdfs admin          6 2020-11-02 22:24 /tmp/hello.txt

Add new platform connection

Cloud Pak for Data -> Connection -> Platform connections -> New connection -> HDFS via Execution Engine for Hadoop

As WebHDFS URL copy and paste WebHDFS endpoint from Hadoop platform integration panel.
Example: https://internal-nginx-svc:12443/ibm-dsxhi-cdpcluster/webhdfs/v1

Add platform connection to CP4D project

Project -> Add to project -> Connection -> From platform -> HDFSWebHDFSConnection (created as above) -> Create

Add dataset to CP4D project

Project->Add to project (top button)->Connected data->Select source

Jupyter notebook accessing HDFS data

After opening the notebook, Watson Studio can inject a Python code to get access data to remote HDFS file.

Data -> Files -> HelloDataSet -> Insert to code -> Credentials

Using the data injected, we can create an URL and get access to HDFS using KNOX HDFS Rest/API. https://cwiki.apache.org/confluence/display/KNOX/Examples+WebHDFS

This cell is generated automatically.

# @hidden_cell
# The following code contains the credentials for a connection in your Project.
# You might want to remove those credentials before you share your notebook.

from ibm_watson_studio_lib import access_project_or_space
wslib = access_project_or_space()
HelloDataSet_credentials = wslib.get_connected_data("HelloDataSet")h = Hello_txt_credentials

Insert Python code

h = HelloDataSet_credentials
URL = url = h['url'] + h['datapath']
headers = {'authorization': 'Bearer ' +  h['access_token']}
params = {'op' : 'GETFILESTATUS'}

import requests
r = requests.get(url,headers=headers,params=params)

print(r.content)
Output: b'{"FileStatus":{"accessTime":1610404702071,"blockSize":134217728,"childrenNum":0,"fileId":21279,"group":"admin","length":22,"modificationTime":1610404702276,"owner":"hdfs","pathSuffix":"","permission":"644","replication":3,"storagePolicy":0,"type":"FILE"}}'
# read HDFS file
params= {'op': 'OPEN'}
r = requests.get(url,headers=headers,params=params)
print(r.text)

Output: 'Hello\nI am\nStanisław\n'

Jupyter notebook integrated with remote Hadoop cluster

Watson Studio Jupyter Notebook can run distributed Spark application on remote Hadoop cluster using HDFS/Hadoop resources. The integration is done by JEG (Jupyter Enterprise Gateway), one of CP4D Hadoop Execution Engine services.

https://www.ibm.com/support/producthub/icpdata/docs/content/SSQNUZ_current/wsj/local/hadoopmodels.html

Endpoints

Make sure that JEG endpoint is defined in CPD4D Hadoop integration panel.

Prerequisites

Make sure that JEG service is started and running.

cd /opt/ibm/dsxhi/bin
./status.py

--Status of DSXHI REST server
dsxhi rest is running with PID 11621.
--Status of JEG
JEG is running with PID 11847.
--Status of gateway server
Gateway is running with PID 11909.

admin user cannot start remote JEG session. It is hardcoded in the configuration file.

vi /opt/ibm/dsxhi/jeg/bin/jeg.sh

function appStart {
  export PYTHON_PATH=${VENV_DIR}/lib/python3.6
  export PATH=${VENV_DIR}/bin:$PATH
#  export EG_UNAUTHORIZED_USERS="root,admin,${dsxhi_serviceuser}"
  export EG_UNAUTHORIZED_USERS="root,${dsxhi_serviceuser}"

Either remove admin from an excluded list (not recommended) or logon to CP4D as a different user (recommended).

Make sure that SPARK_HOME variable is properly set in kernel specification. It could happen if Spark was installed in HDP after installation of Execution Engine for Hadoop.

vi /opt/ibm/dsxhi/jeg/kernelspecs/kernels/spark_python_yarn_cluster/kernel.json


{
   "language":"python3",
   "display_name":"Spark - Python (YARN Cluster Mode)",
   "metadata":{
      "process_proxy":{
         "class_name":"enterprise_gateway.services.processproxies.yarn.YarnClusterProcessProxy"
      }
   },
   "env":{
      "SPARK_HOME":"/usr/hdp/current/spark2-client",
      "PYSPARK_PYTHON":"/usr/bin/python",
      "PYTHONPATH":"/opt/ibm/dsxhi/lib/virtualenvs/jeg_base/lib/python3.6/site-packages:/python",
      "SPARK_OPTS":"--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.sql.catalogImplementation=hive --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=.local ${KERNEL_EXTRA_SPARK_OPTS}",
      "LAUNCH_OPTS":"",
      "HADOOP_CONF_DIR":"",
      "EG_IMPERSONATION_ENABLED":"False",
      "EG_KERNEL_LAUNCH_TIMEOUT":"120"
   },
   "argv":[
      "/opt/ibm/dsxhi/jeg/kernelspecs/spark_python_yarn_cluster/bin/run.sh",
      "--RemoteProcessProxy.kernel-id",
      "{kernel_id}",
      "--RemoteProcessProxy.response-address",
      "{response_address}",
      "--RemoteProcessProxy.port-range",
      "{port_range}",
      "--RemoteProcessProxy.spark-context-initialization-mode",
      "lazy"
   ]
}

Cloudera: SPARK_HOME looks like:

"SPARK_HOME": "/opt/cloudera/parcels/CDH-7.1.5-1.cdh7.1.5.p0.7431829/lib/spark

Create JEG Python 3 environment

Project->Environment-> New environment definition

Create a test notebook

Project->Add to project->Notebook->Select runtime (JEG environment)

!hostname
rr = spark.sparkContext.textFile("/user/admin/hello.txt")
print(rr.take(3))

SparkSQL example.

from os.path import abspath
from pyspark.sql import SparkSession
from pyspark.sql import Row

warehouse_location = abspath('spark-warehouse')
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL") \
    .config("spark.sql.warehouse.dir", warehouse_location) \
    .enableHiveSupport() \
    .getOrCreate()

spark.sql("SHOW DATABASES").show()

Livy2

Define dsxhi user as superuser in Cloudera Livy.

Cloudera Manager -> Livy -> Configuration -> Admin Users/livy.superusers

Follow the Livy example:

https://www.ibm.com/support/producthub/icpdata/docs/content/SSQNUZ_latest/wsj/local/hadoopmodels.html

Make sure that Livy session is created in Cloudera, it can take several minutes to be ready.

%%spark -s $session_name 
rr = spark.sparkContext.textFile("/tmp/hello.txt")
print(rr.take(3))

Kerberized HDP

Register dsxhi user in Kerberos and obtain appropriate keytab file. Modify Hadoop Engine configuration file.

vi /opt/ibm/dsxhi/conf/dsxhi_install.conf

# If the HDP cluster is kerberized, it is mandatory to specify the complete
# path to the keytab for the dsxhi service user and the spnego keytab.
# If the HDP cluster is not kerberized, this field should be left blank.
dsxhi_serviceuser_keytab=/etc/security/keytabs/dsxhi.keytab
dsxhi_spnego_keytab=/etc/security/keytabs/spnego.service.keytab

In Cloudera (CDP) it is necessary to activate SPNEGO feature.
https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/security-kerberos-authentication/topics/cm-security-external-authentication-spnego.html
Next step is to identify SPNEGO keytab on edge node. In CDP cluster, all keytabs are stored in /var/run/cloudera-scm-agent/process directory.

klist -kt /var/run/cloudera-scm-agent/process/1546338628-hdfs-DATANODE/hdfs.keytab

Keytab name: FILE:/var/run/cloudera-scm-agent/process/1546338628-hdfs-DATANODE/hdfs.keytab
KVNO Timestamp           Principal
---- ------------------- ------------------------------------------------------
   1 07/25/2021 01:08:57 HTTP/[email protected]
   1 07/25/2021 01:08:57 hdfs/[email protected]

This keytab contains SPNEGO (HTTP) principal. Copy it to /home/dsxhi directory and make it accessible by dsxhi user.

cp /var/run/cloudera-scm-agent/process/1546338628-hdfs-DATANODE/hdfs.keytab /home/dsxhi/
chown dsxhi: /home/dsxhi/hdfs.keytab

vi /opt/ibm/dsxhi/conf/dsxhi_install.conf

# If the CDH cluster is kerberized, it is mandatory to specify the complete
# path to the keytab for the dsxhi service user and the spnego keytab.
# If the CDH cluster is not kerberized, these properties should be left blank.
dsxhi_serviceuser_keytab=/home/dsxhi/dsxhi.keytab
dsxhi_spnego_keytab=/home/dsxhi/hdfs.keytab

If HDP was Kerberized after Hadoop Engine installation, it is necessary to reinstall it again.

cd /opt/ibm/dsxhi/bin
./manage_known_dsx.py -a https://zen-cpd-zen.apps.rumen.os.fyre.ibm.com:443
./uninstall.py
./install.py

Also in CP4D, the Hadoop integration should be recreated again.

Hadoop proxy user dsxhi impersonates CP4D user to launch Yarn Spark job. Following steps should be performed for CP4D users:

  • register in Kerberos/AD
  • create HDFS /user/{CP4D user} directory and make CP4D user the owner of this directory
  • if Ranger is installed, create a policy to authorize CP4D users to submit jobs in Yarn

CP4D user mapping

CP4D users can be mapped into any other HDP/Cloudera user.

https://knox.apache.org/books/knox-0-12-0/user-guide.html#Default+Identity+Assertion+Provider

Modif Hadoop Engine gateway configuration file:

vi /opt/ibm/dsxhi/gateway/conf/topologies/zen-cpd-zen.xml

..........
  <provider>
         <role>identity-assertion</role>
         <name>Default</name>
         <enabled>true</enabled>
        <param>
            <name>principal.mapping</name>
            <value>admin=guest;</value>
        </param>
      </provider>
.........

Restart the gateway:

cd /opt/ibm/dsxhi/bin
./stop.py
./start.py

The CP4D admin user will reach the Hadoop cluster as guest user. All prerequisites regarding Hadoop users (Kerberos registering, HDFS home directory and Yarn authorization) should be applied to guest user as well.

⚠️ **GitHub.com Fallback** ⚠️