Deploying Guzzle on Azure - ja-guzzle/guzzle

Overview

Guzzle is deployed on Azure on VM. Th deployment will setup Guzzle API app and Web app, Guzzle core (binaries, config folders and third party libs) and third party components namely: Atlas, Elastic-search and Node server.

There are two broader mechanisms provided by Guzzle to setup a fresh env of Guzzle on Azure

Marketplace offer (this is still in testing phase)
Manual install of Guzzle on VM

For Guzzle architecture and overview, refer to: https://github.com/ja-guzzle/docs/wikis/Documentation/Guzzle-Overview

**Note: ** This page contains details of HDInsight for some items and will be updated furhter if customer using HDInsight as spark env instead of Azure databricks as spark env. For now HDinsight details can be ignored.

Reference Architecture for deployment

Attached deck can be referred to negotiate the deploy architecture for Guzzle to secure network for various resources DaiChi_Architecture.pptx

It is important to agree all the approach with customer before embarking

Pre-Requisite

Sr.No.	Resource type	Sub Type	Optional	Purpose	Access credential	Comments
*	Storage Account	Blob Storage (to create the dedicated container)	No	for hosting Guzzle Home which will then be mounted on Guzzle VM and Spark evn (like Databricks and HDInsight)	SAS Keys, Managed identity, Access Keys (preferred) or Service principles	This blob will be mounted in two places: Guzzle VM using Blob fuse and Databricks using DBFS mount: (https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html). On HDInsight it has to be blobfuse again
*	Storage Account	ADLS Gen2 Storage (to create the dedicated filesystem)	Yes	for hosting Hive and Delta tables for target data mart	Service principles	ADLS provides optimal storage layer for big data analytics usecase. The hierarchical namespace is enabled at Storage account level – we can use existing one or new Grant the service principle Contributor role to this storage account or container ( Blob Storage contributor role)
*	Storage Account	Blob Storage (to create the dedicated container)	Yes	for hosting landing area for incoming files	Access Keys	This can be same or separate storage account. Even ADLS Gen2 can be used for landing area however ADLS Gen2 has konwn issues which make it less preferred choice for landing area (https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-known-issues)
*	Virtual Machine	DSV3 (prefer to have dedicated VM)	Yes	Guzzle VM for running guzzle services	User/password or ssh keys	Vnet, public ip, NSG etc can be as per customer requirements (reference: Reference Architecture for deployment). Storage and OS can be as per customer expectation. Recommend: Ubuntu 18.04 LTS as OS image. The disk can be standard SSD or premium.
*	Active Directory	Service principle	Yes	service principle to access both Blob, ADLS and configure SSO for guzzle Web UI	Client ID and Client secret	https://docs.microsoft.com/en-us/azure-stack/operator/azure-stack-create-service-principals?view=azs-1908 and leave redirect URL blank for now (to be updated latter when we integrate for SSO)
*	Active Directory	Managed identity	Yes	Managed identity to securely access storage account from VM	Managed Identity name and resource group	This is optional and is used to avoid using access keys to access storage account
*	Network	VNet	Yes	To secure the Guzzle VM and other pass services (Databricks, storage account and metadata repository)	N/A	Customer can use existing or create new VNET. For DB can use https://docs.azuredatabricks.net/administration-guide/cloud-configurations/azure/vnet-inject.htmland and https://docs.microsoft.com/en-us/azure/virtual-network/virtual-network-service-endpoints-overview ). More on securing PAAS services is at: https://github.com/ja-guzzle/docs/wikis/Azure/Network-Security. Minimum we should keep Guzzle VM only accessible to within VNET via Jumphost or use NSG to secure it from outside
*	Databricks	Databricks Standard workspace	No	This is Spark environment for running Guzzle jobs	Access Tokens (https://docs.databricks.com/dev-tools/api/latest/authentication.html)	Customer can use existing workspace or new one. Standard tier workspace does not allow securing access to tables only via SQL (https://docs.databricks.com/administration-guide/access-control/table-acls/index.html). Hence if customer plans to expose databricks directly via notebook and wants RBAC to grant user access to selected tables then have to use Premium tier. Also for Azure costing one can size up Data Engineering cluster for xx hours https://azure.microsoft.com/en-us/pricing/details/databricks/
*	Database	Azure SQL	Yes	To host Guzzle repository database	Native User / password. Customer can also use AAD accounts for Azure SQL	This is optional and customer can decide to use inbuild MySQL community addition to run Guzzle repository. However the important thing is that if customer wants to use this repository for reporting via PBI- which is crucial for those customer using Recon and DQ monitoring feature, its recommended to use Azure SQL. When using inbuild MySQL, the same has to be securely opened up to PowerBI
*	Database	Azure SQL	Yes	To host Reporting cache or aggregate tables for direct query model of PowerBI. This is knowing that current Hive/Delta tables on Databricks provide slower interactive SQL performance	Native User / password. Customer can also use AAD accounts for Azure SQL	This is optional and customer can decide to have extra hop in data architecture where the final aggregated tables are hosted on Azure SQL or SQL warehouse and exposed through Direct query data sources on PowerBI.

Other aspects

Keep sperate resource and storage account for non-prod and prod
Have some sort of naming convention - I don’t have one to suggest. Can one of you from VN team come up with proposal : I have crated wiki page here: https://github.com/ja-guzzle/docs/wikis/Azure/Best-Practices-for-Creating-Resources I you guys can fill up . I have put links to standard recommendation)
Access control and keeping resources secured.

Deploying Guzzle on Azure - ja-guzzle/guzzle_docs GitHub Wiki

Overview

Reference Architecture for deployment

Pre-Requisite

Other aspects

⚠️ GitHub.com Fallback ⚠️

Deploying Guzzle on Azure - ja-guzzle/guzzle_docs GitHub Wiki

Overview

Reference Architecture for deployment

Pre-Requisite

Other aspects

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️