Azure Databricks Guzzle Setup - ja-guzzle/guzzle_docs GitHub Wiki
-
Create following databases in sql server:
hive_metastore - to be used as external hive metastore
guzzle - to be used as guzzle repository where job audits, batch records, recon/dq outputs etc will be stored
guzzle_api_db - database for guzzle api server -
Create databricks workspace and follow steps in https://docs.azuredatabricks.net/user-guide/advanced/external-hive-metastore.html to setup external hive metastore for databricks cluster
spark.sql.hive.metastore.version 1.2.1
spark.sql.hive.metastore.jars builtin
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:sqlserver://<sqlserver-host>;database=guzzle;encrypt=true;trustServerCertificate=true;create=false;loginTimeout=30
spark.hadoop.javax.jdo.option.ConnectionUserName <database-username>
spark.hadoop.javax.jdo.option.ConnectionPassword <password>
spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
datanucleus.autoCreateSchema true
datanucleus.fixedDatastore false
-
Create azure blob storage account
-
Create container guzzlehome where files related to guzzle home (configs/binaries/libraries) will be stored
-
In databricks cluster spark config add following configuration:
fs.azure.account.key.<storage-account-name>.blob.core.windows.net <storage-account-key>
-
Upload guzzle home to the guzzlehome container
-
Generate api authentication token as per mentioned in https://docs.azuredatabricks.net/api/latest/authentication.html
-
Mount guzzlehome container into dbfs by following steps in https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-storage.html#mount-azure-blob-storage
dbutils.fs.mount(
source = "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>",
mountPoint = "/mnt/<mount-name>",
extraConfigs = Map("<conf-key>" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")))
-
Set environment variable GUZZLE_HOME=/dbfs/<directory where guzzlehome container is mounted> in spark cluster
-
Create guzzle-log4j.properties in /dbfs directory with following content:
log4j.appender.RollingAppender=org.apache.log4j.DailyRollingFileAppender
log4j.appender.RollingAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.RollingAppender.DatePattern='.'yyyy-MM-dd
log4j.appender.RollingAppender.layout.ConversionPattern=[%p] %d %c %M - %m%n
log4j.logger.com.justanalytics=INFO, RollingAppender
-
Write init script to append content of /dbfs/guzzle-log4j.properties to /databricks/spark/dbconf/log4j/driver/log4j.properties
-
Install guzzle-azure-databricks utility jar on the cluster
-
Restart cluster
- In guzzle.yml set database, spark and guzzle configs as following:
database:
type: jdbc
properties:
jdbc_url: jdbc:sqlserver://<sqlserver-host>;database=guzzle;encrypt=true;trustServerCertificate=true;create=false;loginTimeout=30
username: <database-username>
password: <password>
...
spark:
run_mode: azure-databricks
properties:
api_url: https://<region>.azuredatabricks.net
auth_token: <api-token>
cluster_id: <databricks-cluster-id>
dbfs_guzzle_dir: dbfs:/<directory where guzzlehome container is mounted on dbfs>
...
- Mount guzzlehome container into linux file system using following steps:
wget https://packages.microsoft.com/config/ubuntu/16.04/packages-microsoft-prod.deb
sudo dpkg -i packages-microsoft-prod.deb
sudo apt-get update
sudo apt-get -y install blobfuse
echo "user_allow_other" | sudo tee -a /etc/fuse.conf
sudo mkdir /mnt/blobfusetmp
sudo chown <username> /mnt/blobfusetmp
echo "accountName <account-name>
accountKey <account-key>
containerName guzzlehome" > /home/<username>/fuse_connection.cfg
chmod 777 /home/<username>/fuse_connection.cfg
sudo mkdir /guzzle
sudo chmod 777 /guzzle
sudo -H -u <username> bash -c "blobfuse /guzzle --tmp-path=/mnt/blobfusetmp -o allow_other --config-file=/home/<username>/fuse_connection.cfg -o attr_timeout=240 -o entry_timeout=240 -o negative_timeout=120 --file-cache-timeout-in-seconds=10"
- Run guzzle database initializer to generate raw schema content for guzzle database using following command:
java -cp /guzzle/libs/*:/guzzle/libs:/guzzle/bin/common.jar com.justanalytics.guzzle.common.DatabaseInitializer generate
-
Modify raw schema content as per sql server syntax and other necessary changes and execute in guzzle database
-
Follow steps in https://github.com/ja-guzzle/docs/wikis/design-/Guzzle-UI-deployment-runbook to deploy api and ui applications
- To create database in adls:
create database demo location 'adl://guzzletest.azuredatalakestore.net/hive-data/demodb';
create table demo.users ( id int, first_name string, last_name string, age decimal(2,0), created_time timestamp) partitioned by (instance_id bigint, system string, location string) stored as parquet;
- Init guzzle utility from databricks notebook:
import com.justanalytics.guzzle.util.databricks.azure.GuzzleUtils
val guzzle = new GuzzleUtils(<api-url>, <api-username>, dbutils.secrets.get(scope = "demoscope", key = <api-password>), <cluster-name>)
- Run guzzle job using guzzle databricks utility:
guzzle.runJob(<job-config-name>, <environment-name>, Map(<job-params-including-business-date>))