Deploying Azure Marketplace Offering Gen 1 - ja-guzzle/guzzle

Overview
Network_Architecture_for_Guzzle_on_Azure
Pre-requisite
Must to know
Deployment Steps
Basic Info
VM Details and Managed Identity
Guzzle Settings
Databricks Settings
Review and Create
Upon completion of Deployment
What will you see when its deploying Marketplace offer
In Guzzle VM
In Databricks
In Storage Account
In SQL Server databases
Once everything is set
What happens when we redo Marketplace deployment using same Azure resources multiple times
Atlas Issues
Update the Atlas startup command as per below
Using ADLSv2 as Storage for Delta and Hive tables and usage of Databricks Secrets
Securing Guzzle Deployment
VM and SSO
Stop all the services
Verify no guzzle services are running except blobfuse using following
Create a new user guzzle. Provide relevant details prompted including a complex password for this account.
Change the ownership of /opt/apache-atlas-2.0.0 and /opt/elasticsearch-6.2.4 to guzzle:guzzle
Delete spring.log
Start Guzzle service
Azure Databricks security

Overview

Guzzle Marketplace offer is out. The URL to the offer is here: https://azuremarketplace.microsoft.com/en-us/marketplace/apps/justanalytics.guzzle-databricks?tab=Overview
Here is quick video of how to get this all working

Network_Architecture_for_Guzzle_on_Azure

Pre-requisite

Ensure you have following resource The list is also captured at: https://github.com/ja-guzzle/docs/-/wikis/Design/Azure-Marketplace-Offering-Gen-1#solution-template

Empty Resource group created in one of your existing subscription. This subscription is treated as primary subscription
Managed identity created in the Tenant (Azure Active Directory)
Storage account and container - This has to be blob storage (and not ADLS/Hierarchy namespace). Also the Managed Identity in step 2 should have full owner permission to this storage account (not just the container). Also the blobstorage is open to
SQL Server database to host Guzzle repository. Only native/local accounts are supported (Azure AD accounts are not supported)
Separate SQL Server database to host databricks metastore . Only native/local accounts are supported (Azure AD accounts are not supported).
Both SQL server databases can be in same of different server
Databricks workspace name, region, organization id and Access token
Service Principle /Application registration to support single sing-on : https://docs.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app (this involves step in Azure AD to update the redirect URL) **Note: ** If you don't have the "Service Principle /Application registration", just put any xxx (or some arbitrary string) for clientid and client secrete. You can always update this latter with correct value for SSO to work

Must to know

Don't have special character in the passwords - specially $ and |
Ensure Azure SQL server Database is firewall rules allows all Azure resources to connect to it (Azure SQL Server connections shall happen from Databricks and Guzzle VM)
Guzzle Storage Account and Guzzle VM should be in same subscription (the setup script takes the VM's subscription and uses that to mount blob storage)
Password for External metastore for databricks is stored as plain text in Guzzle configs. its recommended to switch to internal metastore once Guzzle marketplace offer is up and running

Deployment Steps

Basic Info

VM Details and Managed Identity

Guzzle Settings

Databricks Settings

Review and Create

Upon completion of Deployment

What will you see when its deploying Marketplace offer

In Guzzle VM

The deployment is progressing. The setup script can take upto 10 minutes as it copies 500MB+ of files over to blob store
You see the setup script for python running
All the logs goes in this folder: /var/lib/waagent/custom-script/download/0/ in VM. stderr and stdout has all the key information

demoadmin@guzzlemp2vm:~$ sudo bash
root@guzzlemp2vm:~# cd /var/lib/waagent/custom-script/download/0/
root@guzzlemp2vm:/var/lib/waagent/custom-script/download/0# ls -ltrh
total 28K
drwxr-xr-x 2 root root 4.0K Apr 10 07:41 scripts
drwxr-xr-x 2 root root 4.0K Apr 10 07:42 logs
-rw------- 1 root root 3.7K Apr 10 07:42 stderr
-rw------- 1 root root  16K Apr 10 07:46 stdout
root@guzzlemp2vm:/var/lib/waagent/custom-script/download/0# date
Fri Apr 10 07:51:56 UTC 2020
root@guzzlemp2vm:/var/lib/waagent/custom-script/download/0#

Finally once setup is complete you should see following processes running in Guzzle VM ( There are the ones which may be missing and can be ignored : sudo --preserve-env=PATH -HEu demoadmin )
bloufuse to mount Guzzle home blobg container to VM
Elastic search used by Atlas
Node server (web server of Guzzle)
Elastic search by Guzzle
Atlas Java process
Guzzle API process

oot@guzzlemp2vm:/guzzle/logs# ps -ef | grep "[g]uzzle\|[n]ode\|[e]lastic\|[a]tlas"
root       1323      1  1 04:10 ?        00:03:20 blobfuse /guzzle --tmp-path=/mnt/blobfusetmp -o allow_other --config-file=/root/fuse_connection.cfg -o attr_timeout=240 -o entry_timeout=240 -o negative_timeout=120 --file-cache-timeout-in-seconds=10
root       1326      1  0 04:10 ?        00:00:00 sudo --preserve-env=PATH -HEu demoadmin bash -c /opt/elasticsearch-6.2.4/bin/elasticsearch
root       1473      1  0 04:10 ?        00:00:00 sudo --preserve-env=PATH -HEu demoadmin bash -c java -Dloader.path=/guzzle/api/libs -jar api-0.0.1-SNAPSHOT.jar
demoadm+   1478   1473  0 04:10 ?        00:02:02 java -Dloader.path=/guzzle/api/libs -jar api-0.0.1-SNAPSHOT.jar
demoadm+   1543   1326  0 04:10 ?        00:01:49 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/tmp/elasticsearch.1UkoCBcT -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=32 -XX:GCLogFileSize=64m -Des.path.home=/opt/elasticsearch-6.2.4 -Des.path.conf=/opt/elasticsearch-6.2.4/config -cp /opt/elasticsearch-6.2.4/lib/* org.elasticsearch.bootstrap.Elasticsearch
root       1776      1  0 04:10 ?        00:00:00 node /opt/node-v6.14.2-linux-x64/bin/http-server -p 8082 --push-state --ssl --cert /certs/cert.pem --key /certs/privatekey_new.pem .
demoadm+   1944      1  0 04:10 ?        00:00:00 bash /opt/apache-atlas-2.0.0/hbase/bin/hbase-daemon.sh --config /opt/apache-atlas-2.0.0/hbase/conf foreground_start master
demoadm+   1961   1944  0 04:10 ?        00:01:36 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Dproc_master -XX:OnOutOfMemoryError=kill -9 %p -XX:+UseConcMarkSweepGC -Dhbase.log.dir=/opt/apache-atlas-2.0.0/hbase/bin/../logs -Dhbase.log.file=hbase-demoadmin-master-guzzlemp2vm.log -Dhbase.home.dir=/opt/apache-atlas-2.0.0/hbase/bin/.. -Dhbase.id.str=demoadmin -Dhbase.root.logger=INFO,RFA -Dhbase.security.logger=INFO,RFAS org.apache.hadoop.hbase.master.HMaster start
demoadm+   2092      1  0 04:10 ?        00:00:43 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -server -Xms512m -Xmx512m -XX:NewRatio=3 -XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8 -XX:+UseConcMarkSweepGC -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 -XX:+CMSScavengeBeforeRemark -XX:PretenureSizeThreshold=64m -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=50 -XX:CMSMaxAbortablePrecleanTime=6000 -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -XX:-OmitStackTraceInFastThrow -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:/opt/apache-atlas-2.0.0/solr/server/logs/solr_gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M -DzkClientTimeout=15000 -DzkHost=localhost:2181 -Dsolr.log.dir=/opt/apache-atlas-2.0.0/solr/server/logs -Djetty.port=9838 -DSTOP.PORT=8838 -DSTOP.KEY=solrrocks -Duser.timezone=UTC -Djetty.home=/opt/apache-atlas-2.0.0/solr/server -Dsolr.solr.home=/opt/apache-atlas-2.0.0/solr/server/solr -Dsolr.data.home= -Dsolr.install.dir=/opt/apache-atlas-2.0.0/solr -Dsolr.default.confdir=/opt/apache-atlas-2.0.0/solr/server/solr/configsets/_default/conf -Xss256k -Dsolr.jetty.https.port=9838 -Dsolr.log.muteconsole -XX:OnOutOfMemoryError=/opt/apache-atlas-2.0.0/solr/bin/oom_solr.sh 9838 /opt/apache-atlas-2.0.0/solr/server/logs -jar start.jar --module=http
demoadm+   3029      1  1 04:11 ?        00:03:36 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Datlas.log.dir=/opt/apache-atlas-2.0.0/logs -Datlas.log.file=application.log -Datlas.home=/opt/apache-atlas-2.0.0 -Datlas.conf=/opt/apache-atlas-2.0.0/conf -Xmx1024m -Dlog4j.configuration=atlas-log4j.xml -Djava.net.preferIPv4Stack=true -server -classpath /opt/apache-atlas-2.0.0/conf:/opt/apache-atlas-2.0.0/server/webapp/atlas/WEB-INF/classes:/opt/apache-atlas-2.0.0/server/webapp/atlas/WEB-INF/lib/*:/opt/apache-atlas-2.0.0/libext/*:/opt/apache-atlas-2.0.0/hbase/conf org.apache.atlas.Atlas -app /opt/apache-atlas-2.0.0/server/webapp/atlas

In Databricks

In Databricks workspace you will see new analytics cluster called guzzle-config created (if there is existing one, that continues to remain - as cluster has unique id underneath and they don't go by name)
Also you would notice that Guzzle Home will have got mounted in Databricks workspace (mounts in Databricks are at workspace level and NOT at cluster level) On Databricks workspace you can create sample notebook and run it against guzzle-config or any anlaytics cluster) and ensure you see the Guzzle home mounted:

In Storage Account

In storage account you will see new files added up in container which hosts Guzzle Home

In SQL Server databases

The Guzzle repository and Guzzle API table should show up in Guzzle repository database. The below ones will show up first as Guzzle Repo gets initialized And below highlighted ones shows up when API comes up:
Hive metastore tables should show up in External metastore Before when there are no tables in this databaes After the job starts (steps to run the jobs before :

Once everything is set

Launch the Guzzle URL
And then login
Run the sample job from here
This brings up Job Cluster. This also initializes hivemetastore as the first job runs using the external meastore !
On successful completion the job should show up

What happens when we redo Marketplace deployment using same Azure resources multiple times

If the blobstorage container, contains existing Guzzle files, it will simply overwrite them when doing a fresh marketplace deploy.
If databricks workspace is already containing the guzzle-config cluster, its ignored and fresh one is created. Any existing mount of guzzle blotstorage container on /mnt/guzzle in that workspace is unmounted. A new mount is done pointing to the details given in marketplace wizard
Guzzle repository database tables if present in the same Azure SQL database, the deployment of Guzzle repository gets skipped (no cleanup is required - unless the database contains tables from older version of Guzzle deployment)
Hive metastore tables if present in the same Azure SQL database for External metastore database for Databricks, the step of deploying fresh metastore tables is skipped (you don't need any cleanup)

Atlas Issues

Following are the fixes of issues in Apache Atlas

By default marketplace starts Atlas using the root. We need to modify Guzzle startup script to change this to non-root account. We can use the account used during the VM creation (demoadmin or appropriate user). Once this is chagnd, restart Guzzle VM

sudo bash
chown demoadmin:demoadmin -R /opt/apache-atlas-2.0.0
vi /opt/guzzlescript/guzzle-startup-script.sh

# Update the Atlas startup command as per below
 nohup sudo --preserve-env=PATH -HEu demoadmin bash -c  "/opt/apache-atlas-2.0.0/bin/atlas_start.py" > /guzzle/logs/atlas.out &

2. Edit the /guzzle/conf/atlas.yml file to remove the below entirely

hive:
  jdbc_url: jdbc:spark:....

The revised file should look as per below:

Restart the Guzzle VM from Azure Portal and wait for 10 minutes for all the services to come up

Atlas sync only supported for Local file system

Atlas sync is currently only supported local file system, Hive, Delta and JDBC sources. It does not support all the cloud file system
In databricks guzzle home mounted under /dbfs/mnt/guzzle while in Guzzle VM its mounted under /guzzle.
When referring to the guzzle home and source file there in one has to use /dbfs/mnt/guzzle
Create symbolic link in Guzzle VM that points to the same LFS (local file system) path as Databricks run below command (first command is to switch to root)

demoadmin@guzzlemp2vm:/guzzle/test-data$ sudo bash
root@guzzlemp2vm:/guzzle/test-data# mkdir -p /dbfs/mnt/guzzle
root@guzzlemp2vm:/guzzle/test-data# cd /dbfs/mnt/guzzle/
root@guzzlemp2vm:/dbfs/mnt/guzzle# ln -s /guzzle/test-data

Note: This step is not required if the blob storage where source files are present in same directory on Guzzle VM and Databricks.

Wild card (multiple files are not supported when retrieving the column list for Atlas sync)

In the job config use specific file names like users2.csv instead of user*.csv

Update the user and password for ph_hive (or appropriate hive or delta physical connection with username as "token" and password as API access token retrieved from Databricks workspace)

6. Run the scrip to create Guzzle metadata type (this is only if when you don't see guzzle metatypes like guzzle_dataset , guzzle_process etc in Atlas)

sudo bash
/opt/guzzlescript/create-atlas-types.sh

Using ADLSv2 as Storage for Delta and Hive tables and usage of Databricks Secrets

Create ADLS storage account and file system
Grant the Blob storage contributor permission to one of the Service Principal (you can use existing service principal that is being used for Azure AD SSO)
Create Azure key vaults to store the secrets (in this case the Service principal client id and client secret) (follow the link here: https://docs.microsoft.com/en-us/azure/key-vault/quick-create-portal ) or video
Create Databricks secret scope backed by Azurekey vault (follow Video) or this link: https://docs.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes#--create-an-azure-key-vault-backed-secret-scope
Mount this storage account on Databricks workspace under /mnt/data - we will use plain notebook to specify the secrets of Service principal. I have used Python notebook to do it. Notice that I have used Databricks secrete backed by

configs = {"fs.azure.account.auth.type": "OAuth",
           "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
           "fs.azure.account.oauth2.client.id": dbutils.secrets.get(scope = "guzzlevm2scope", key = "guzzlemv2spclient"),
           "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope = "guzzlevm2scope", key = "guzzlemv2spclientsec"),
           "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/bc5c7327-b12e-48db-a856-175591ecd2f0/oauth2/token"}

dbutils.fs.mount(
  source = "abfss://[email protected]/data",
  mount_point = "/mnt/data",
  extra_configs = configs)

Verify the ADLS mount is working fine and you are able list the files. Do take not that dbfs mount is /mnt/data ,which implies the mount on Unix (or local file system representation) will be /dbfs/mnt/data
Now, Create a database "test" in Databricks and point to the /mnt/data/test as the location (do take note that folder test if not present then Databricks will create when creating database) . With this all the tables in database "test" shall go into the folder /mnt/data/test in DBFS which in turn then goes into datalake1 container (aka filesystem) in the ADLSgen2 storage account: adlsv2guzzlecommon. Also create sample table so and insert some data into it

%sql
create database test location '/mnt/data/test';
use test;
create table t1(i int);
insert into t1 values(1);

When going back to ADLS file explorer you should notice a new sub-folder test and child folder in it t1 will have been created confirming that table has got corrected created in the ADLS stroage

Go to Guzzle and define new logical and physical end point pointing to the this new database "test". You can create both Hive and Delta endpoints against the same database "test" and the tables will be stored as delta or hive tables based on which end point is used or target

Securing Guzzle Deployment

VM and SSO

Enable azure SSO for Guzzle

Ensure redirect URL is put in Authentication tab in App Registration "https://<<hostname[.southeastasia.cloudapp.azure.com]>>:8082/oauth/microsoft"
Add your Azure AD account as admin role. Put the password which is complex or random as you will not need it. Make sure you are able to login through

Delete the Native Admin account from Guzzle repository. Login to SQL server and run below commands against guzzle Repoisotyr database

delete from user_authorities  where user_id = 1;
delete from users where id = 1;

Enable Firewalls for Guzzle VM

Guzzle VM : Only open selected ports for external traffic : 9090, 8082, 21000 and 22)
Outbound traffic will have following rules - internet access will be required (till private endpionts can be used for all the Azure pass services):

Run Guzzle services from non root account and account which has no sudo access. Here is what is required to use a new account "guzzle" to run all the Guzzle services. Do take note that original account used for VM creation is what should be used to login to the VM when required so that appropriate commands can be used to run using sudo permission :

sudo bash
# Stop all the services
kill -9 `ps aux | grep '[a]pi-0.0.1-SNAPSHOT.jar' | awk '{print $2}'`
kill -9 `ps aux | grep '[e]lasticsearch-6.2.4' | awk '{print $2}'`
kill -9 `ps aux | grep '[h]ttp-server' | awk '{print $2}'`
kill -9 `ps aux | grep '[a]tlas' | awk '{print $2}'`
# Verify no guzzle services are running except blobfuse using following
ps -ef | grep "[g]uzzle\|[n]ode\|[e]lastic\|[a]tlas"

# Create a new user guzzle. Provide relevant details prompted including a complex password for this account. 
adduser guzzle

# Change the ownership of /opt/apache-atlas-2.0.0 and /opt/elasticsearch-6.2.4 to guzzle:guzzle
chown -R guzzle:guzzle /opt/apache-atlas-2.0.0
chown -R guzzle:guzzle /opt/elasticsearch-6.2.4

update the /opt/guzzlescript/guzzle-startup-script.sh to point all the service start using guzzle account. You can take the backup using the command "cp /opt/guzzlescript/guzzle-startup-script.sh /opt/guzzlescript/guzzle-startup-script.sh.ori"

echo "guzzle startup script execution started"

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export GUZZLE_HOME=/guzzle
export PATH=/opt/guzzlescript/:/opt/elasticsearch-6.2.4/bin:/opt/node-v6.14.2-linux-x64/bin:/opt/spark-2.4.5-bin-hadoop2.7/bin:$PATH
export MANAGE_LOCAL_HBASE=true
export MANAGE_LOCAL_SOLR=true

blobfuse /guzzle --tmp-path=/mnt/blobfusetmp -o allow_other --config-file=/root/fuse_connection.cfg -o attr_timeout=240 -o entry_timeout=240 -o negative_timeout=120 --file-cache-timeout-in-seconds=10

nohup sudo --preserve-env=PATH -HEu guzzle bash -c "/opt/elasticsearch-6.2.4/bin/elasticsearch" > /guzzle/logs/elasticsearch.out &

cd /guzzle/api/
nohup sudo --preserve-env=PATH -HEu guzzle bash -c "java -Dloader.path=/guzzle/api/libs -jar api-0.0.1-SNAPSHOT.jar" > /dev/null &

cd /guzzle/web/
nohup sudo --preserve-env=PATH -HEu guzzle bash -c "http-server -p 8082 --push-state --ssl --cert /certs/cert.pem --key /certs/privatekey_new.pem" . > /guzzle/logs/web.out &

if [[ "yes" == "yes" ]]; then
  nohup sudo --preserve-env=PATH -HEu guzzle bash -c "/opt/apache-atlas-2.0.0/bin/atlas_start.py" > /guzzle/logs/atlas.out &
fi

echo "guzzle startup script execution completed"

sudo bash
# Delete spring.log
rm /tmp/spring.log
# Start Guzzle service
 /opt/guzzlescript/guzzle-startup-script.sh

Change the permission of the fuse config to 400

sudo bash
cd /root/
chmod 400 fuse_connection.cfg
ls -ltrh

Recommended to restart Guzzle VM which should restart all the services using newly created guzzle account

Enable Network security for PAAS services - Azure SQL Server
Enable Network security for PAAS services - Storage accounts (blob and ADLS if you are using that)

Azure Databricks security

Working with Databricks Secrets

Deploying Azure Marketplace Offering Gen 1 - ja-guzzle/guzzle_docs GitHub Wiki

Table of Contents

Overview

Network_Architecture_for_Guzzle_on_Azure

Pre-requisite

Must to know

Deployment Steps

Basic Info

VM Details and Managed Identity

Guzzle Settings

Databricks Settings

Review and Create

Upon completion of Deployment

What will you see when its deploying Marketplace offer

In Guzzle VM

In Databricks

In Storage Account

In SQL Server databases

Once everything is set

What happens when we redo Marketplace deployment using same Azure resources multiple times

Atlas Issues

Using ADLSv2 as Storage for Delta and Hive tables and usage of Databricks Secrets

Securing Guzzle Deployment

VM and SSO

Azure Databricks security

⚠️ GitHub.com Fallback ⚠️

Deploying Azure Marketplace Offering Gen 1 - ja-guzzle/guzzle_docs GitHub Wiki

Table of Contents

Overview

Network_Architecture_for_Guzzle_on_Azure

Pre-requisite

Must to know

Deployment Steps

Basic Info

VM Details and Managed Identity

Guzzle Settings

Databricks Settings

Review and Create

Upon completion of Deployment

What will you see when its deploying Marketplace offer

In Guzzle VM

In Databricks

In Storage Account

In SQL Server databases

Once everything is set

What happens when we redo Marketplace deployment using same Azure resources multiple times

Atlas Issues

Using ADLSv2 as Storage for Delta and Hive tables and usage of Databricks Secrets

Securing Guzzle Deployment

VM and SSO

Azure Databricks security

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️