SparkSQL, Spark - stanislawbartkowski/hdpactivedirectory GitHub Wiki

SparkSQL, Thrift Server

Configure

As Hive, Spark SQL / Thrift Server can be accessed through beeline command line. Identify the host where Spark2 Thrift Server is installed and the connection port (default: 10016). The beeline command line is to be like:

kinit ... beeline -u "jdbc:hive2://aa1.fyre.ibm.com:10016/;principal=hive/[email protected];transportMode=binary;httpPath=cliservice"

beeline -u "jdbc:hive2://aa1.fyre.ibm.com:10016/;principal=hive/[email protected];transportMode=binary;httpPath=cliservice"
[perf@varlet1 ~]$ thr
Connecting to jdbc:hive2://aa1.fyre.ibm.com:10016/;principal=hive/[email protected];transportMode=binary;httpPath=cliservice
Connected to: Spark SQL (version 2.3.0.2.6.5.1050-37)
Driver: Hive JDBC (version 1.2.1000.2.6.5.1050-37)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1000.2.6.5.1050-37 by Apache Hive
0: jdbc:hive2://aa1.fyre.ibm.com:10016/> show databases;
+---------------+--+
| databaseName  |
+---------------+--+
| bigsql        |
| datalake      |
| default       |
| perfdb        |
| perfte        |
+---------------+--+
5 rows selected (0,129 seconds)

Ranger

Although SparkSQL is running on the top of Hive tables, the Hive Ranger policies do not impact SparkSQL because SparkSQL is a separate SQL engine and bypasses Hive. There is no dedicated Ranger plugin for SparkSQL and the protection should be orchestrated using other means. https://hortonworks.com/blog/sparksql-ranger-llap-via-spark-thrift-server-bi-scenarios-provide-row-column-level-security-masking/

SparkSQL shell, Spark

Spark SQL can be also launched directly without Thrift Server. There is a time penalty until Spark shell is up and ready.

export SPARK_MAJOR_VERSION=2 kinit spark-sql --master yarn --num-executors 2 -S

...............
spark-sql> show databases;
bigsql
datalake
default
perfdb
perfte

pyspark

SPARK_MAJOR_VERSION is set to 2, using Spark2
Python 2.7.5 (default, Oct 30 2018, 23:45:53) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0.2.6.5.1050-37
      /_/

Using Python version 2.7.5 (default, Oct 30 2018 23:45:53)
SparkSession available as 'spark'.
>>> spark
<pyspark.sql.session.SparkSession object at 0x7f89c94c57d0>