Guzzle on Azure Databricks - ja-guzzle/guzzle_docs GitHub Wiki

Table of Contents

Overview

Azure resources setup

  1. Azure Blob storage to host the Guzzle home
  2. Azure ADLSv2 to store the target tables
  3. Databricks Workspace and mount Guzzle home
  4. Azure SQL Server DB and SQL Server Instance
  5. VM To run Guzzle API and UI application

Guzzle Setup

  1. Download Guzzle release bundle
  2. Download JDBC Driver for SQL Server
  3. Fuse for mounting blob to Oracle
  4. Guzzle configuration changes
  5. Test sample guzzle job

Azure resources setup

1. Create Azure Blob storage to host the

Storage Account Name: testguzzleblob image

Create the container to store the Guzzle home in this storage account: a. Go to stroage account and click on Blobs image

b. Enter the container name guzzlehome and press ok image

c. Retrieve the access key for storage account testguzzleblob image

2. Create Azure ADLSv2 to store the target tables

Storage Account Name: testguzzleadlsv2

image

Click on Next to got Advanced Settings and Enable "Hierarchical namespace" image

The click Review and Crate

Create the container to store the data in this storage account: a. Go to storage account testguzzleadlsv2 and click on "Data Lake Gen2 file systems")

image

b. Enter the file system name data and press ok image

c. Retrieve the access key for storage account testguzzleadlsv2. This is required for DB Workspace cluster creation: image

3. Crate Databricks Workspace

a. Create DB Workspace image Enter all the details as highlighted below: image

b. Go to "testguzzle" workspace, "Launch Workspace" and Create the "test" cluster with below settings: image

Go to Advance settings and place following:

For Spark tab, put the Env Variabel and the Access key for testguzzleadlsv2 (look at step 2.c. to get the info) image

fs.azure.account.key.testguzzleadlsv2.dfs.core.windows.net qdYM4RrzuhPTx9AQ+vuRQO9+o3xOmmka/9cWuVUOwC+SBCA16hSF8H/xwemEIEvEGbshajm7Nt4Q1dfahzRoTQ==

PYSPARK_PYTHON=/databricks/python3/bin/python3
GUZZLE_HOME=/dbfs/mnt/guzzle

c. Click on Create Cluster. It will take upto 1 min to create cluster

d. Launch Notebook and create following cells: image

image

dbutils.fs.mount(
  source = "wasbs://[email protected]/",
  mountPoint = "/mnt/guzzle",
  extraConfigs = Map("fs.azure.account.key.testguzzleblob.blob.core.windows.net" -> "z7nxS6O2WVGeHfjMRvjVYeQ3P5t+qBSiA1r2gI0cOP3JiLQl03mX27ZPvuZBBJZOabNwUwfDSlE03uQeukgq9Q=="))

Test the mounted directory is working:

%sh
ls -ltrh /dbfs/mnt/guzzle

Create the database to store the target tables. The storage of this shall be ADLSv2 (the keys are added on spark config in the step 3.b above)

%sql
create database demo location 'abfss://[email protected]/hive/demo';
create table t1(i int);
insert into t1 values(1);

Create init script file which is required for Databricks to correctly capture guzzle logs for the driver program:

%sh
echo "log4j.appender.RollingAppender=org.apache.log4j.DailyRollingFileAppender
log4j.appender.RollingAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.RollingAppender.DatePattern='.'yyyy-MM-dd
log4j.appender.RollingAppender.layout.ConversionPattern=[%p] %d %c %M - %m%n
log4j.logger.com.justanalytics=INFO, RollingAppender
" > /dbfs/guzzle-log4j.properties

mkdir -p /dbfs/databricks/initscript
echo "#!/bin/bash
cat /dbfs/guzzle-log4j.properties >> /databricks/spark/dbconf/log4j/driver/log4j.properties
cat /dbfs/guzzle-log4j.properties >> /databricks/spark/dbconf/log4j/executor/log4j.properties
" > /dbfs/databricks/initscript/init.sh

image

Update the cluster Init Scripts to to incude the init script generated using the notebook above: The script has to be set to: dbfs:/databricks/initscript/init.sh The cluster has to be restarted post taht

image

image

4. Azure SQL Server DB and SQL Server Instance

Db name: testguzzledb Server: testguzzledbserver (the full host name shall be: testguzzledbserver .database.windows.net) User: demo Password: Admin@123

image

5. VM To run Guzzle API and UI application

a. Create VM VM Name: testguzzlevm Hostname: testguzzlevm(the full host name shall be: testguzzlevm.eastasia.cloudapp.azure.com) User: demo Password: Admin@123456

image

b. Configure the domain name for the VM:

image

image

c. Enable Network access (opening all the ports to be accessible from JA VM aka JA public ip) image

Guzzle Setup

1. Fuse for mounting blob to Oracle

a. Login to guzzle vm with demo user account Host: testguzzlevm.eastasia.cloudapp.azure.com User: demo Password: Admin@123456

When prompted for password, enter the demo account password above

b. Install the fuse package

sudo rpm -Uvh https://packages.microsoft.com/config/rhel/7/packages-microsoft-prod.rpm
sudo sudo yum install blobfuse fuse -y        
echo "user_allow_other" | sudo tee -a /etc/fuse.conf
sudo mkdir /mnt/blobfusetmp
sudo chown <username> /mnt/blobfusetmp

c. Mount the guzzle home

echo "accountName testguzzleblob
accountKey z7nxS6O2WVGeHfjMRvjVYeQ3P5t+qBSiA1r2gI0cOP3JiLQl03mX27ZPvuZBBJZOabNwUwfDSlE03uQeukgq9Q==
containerName guzzlehome" > /home/demo/fuse_connection.cfg


chmod 777 /home/demo/fuse_connection.cfg
sudo mkdir /guzzle
sudo chmod 777 /guzzle
sudo -H -u demo bash -c "blobfuse /guzzle --tmp-path=/mnt/blobfusetmp -o allow_other --config-file=/home/demo/fuse_connection.cfg -o attr_timeout=240 -o entry_timeout=240 -o negative_timeout=120 --file-cache-timeout-in-seconds=10"
cd /guzzle
echo "test" >a.log
ls -ltrh

2. Download Guzzle release bundle

a. Login to VM b. Download the package using the commands below:

cd guzzle
wget -q https://guzzlesa.blob.core.windows.net/guzzle-release/guzzle-0.7.34.tar.gz
tar xzf guzzle-0.6.7.tar.gz  --strip-components 1 
cd libs
wget -O mssql-jdbc-6.1.0.jre8.jar https://guzzlesa.blob.core.windows.net/guzzle-release/mssql-jdbc-6.1.0.jre8.jar?sv=2018-03-28&ss=bqtf&srt=sco&sp=rwdlacup&se=2019-05-22T12:48:38Z&sig=gUNYA948mxn5rXzZG2yyv6yLdXppXmZpkPpy7jO9%2Bb8%3D&_=1558500986434
mkdir ../api/libs
cp mssql-jdbc-6.1.0.jre8.jar ../api/libs

3. Guzzle configuration changes

  1. Login to vm
Export guzzle_home
export GUZZLE_HOME=/guzzle



### 4. Test sample guzzle job
⚠️ **GitHub.com Fallback** ⚠️