Deploying Guzzle on Azure - ja-guzzle/guzzle_docs GitHub Wiki

Overview

Guzzle is deployed on Azure on VM. Th deployment will setup Guzzle API app and Web app, Guzzle core (binaries, config folders and third party libs) and third party components namely: Atlas, Elastic-search and Node server.

There are two broader mechanisms provided by Guzzle to setup a fresh env of Guzzle on Azure

  1. Marketplace offer (this is still in testing phase)
  2. Manual install of Guzzle on VM

For Guzzle architecture and overview, refer to: https://github.com/ja-guzzle/docs/wikis/Documentation/Guzzle-Overview

**Note: ** This page contains details of HDInsight for some items and will be updated furhter if customer using HDInsight as spark env instead of Azure databricks as spark env. For now HDinsight details can be ignored.

Reference Architecture for deployment

Attached deck can be referred to negotiate the deploy architecture for Guzzle to secure network for various resources DaiChi_Architecture.pptx

It is important to agree all the approach with customer before embarking

Pre-Requisite

Sr.No. Resource type Sub Type Optional Purpose Access credential Comments
* Storage Account Blob Storage (to create the dedicated container) No for hosting Guzzle Home which will then be mounted on Guzzle VM and Spark evn (like Databricks and HDInsight) SAS Keys, Managed identity, Access Keys (preferred) or Service principles This blob will be mounted in two places: Guzzle VM using Blob fuse and Databricks using DBFS mount: (https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html). On HDInsight it has to be blobfuse again
* Storage Account ADLS Gen2 Storage (to create the dedicated filesystem) Yes for hosting Hive and Delta tables for target data mart Service principles ADLS provides optimal storage layer for big data analytics usecase. The hierarchical namespace is enabled at Storage account level – we can use existing one or new Grant the service principle Contributor role to this storage account or container ( Blob Storage contributor role)
* Storage Account Blob Storage (to create the dedicated container) Yes for hosting landing area for incoming files Access Keys This can be same or separate storage account. Even ADLS Gen2 can be used for landing area however ADLS Gen2 has konwn issues which make it less preferred choice for landing area (https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-known-issues)
* Virtual Machine DSV3 (prefer to have dedicated VM) Yes Guzzle VM for running guzzle services User/password or ssh keys Vnet, public ip, NSG etc can be as per customer requirements (reference: Reference Architecture for deployment). Storage and OS can be as per customer expectation. Recommend: Ubuntu 18.04 LTS as OS image. The disk can be standard SSD or premium.
* Active Directory Service principle Yes service principle to access both Blob, ADLS and configure SSO for guzzle Web UI Client ID and Client secret https://docs.microsoft.com/en-us/azure-stack/operator/azure-stack-create-service-principals?view=azs-1908 and leave redirect URL blank for now (to be updated latter when we integrate for SSO)
* Active Directory Managed identity Yes Managed identity to securely access storage account from VM Managed Identity name and resource group This is optional and is used to avoid using access keys to access storage account
* Network VNet Yes To secure the Guzzle VM and other pass services (Databricks, storage account and metadata repository) N/A Customer can use existing or create new VNET. For DB can use https://docs.azuredatabricks.net/administration-guide/cloud-configurations/azure/vnet-inject.htmland and https://docs.microsoft.com/en-us/azure/virtual-network/virtual-network-service-endpoints-overview ). More on securing PAAS services is at: https://github.com/ja-guzzle/docs/wikis/Azure/Network-Security. Minimum we should keep Guzzle VM only accessible to within VNET via Jumphost or use NSG to secure it from outside
* Databricks Databricks Standard workspace No This is Spark environment for running Guzzle jobs Access Tokens (https://docs.databricks.com/dev-tools/api/latest/authentication.html) Customer can use existing workspace or new one. Standard tier workspace does not allow securing access to tables only via SQL (https://docs.databricks.com/administration-guide/access-control/table-acls/index.html). Hence if customer plans to expose databricks directly via notebook and wants RBAC to grant user access to selected tables then have to use Premium tier. Also for Azure costing one can size up Data Engineering cluster for xx hours https://azure.microsoft.com/en-us/pricing/details/databricks/
* Database Azure SQL Yes To host Guzzle repository database Native User / password. Customer can also use AAD accounts for Azure SQL This is optional and customer can decide to use inbuild MySQL community addition to run Guzzle repository. However the important thing is that if customer wants to use this repository for reporting via PBI- which is crucial for those customer using Recon and DQ monitoring feature, its recommended to use Azure SQL. When using inbuild MySQL, the same has to be securely opened up to PowerBI
* Database Azure SQL Yes To host Reporting cache or aggregate tables for direct query model of PowerBI. This is knowing that current Hive/Delta tables on Databricks provide slower interactive SQL performance Native User / password. Customer can also use AAD accounts for Azure SQL This is optional and customer can decide to have extra hop in data architecture where the final aggregated tables are hosted on Azure SQL or SQL warehouse and exposed through Direct query data sources on PowerBI.

Other aspects

  1. Keep sperate resource and storage account for non-prod and prod
  2. Have some sort of naming convention - I don’t have one to suggest. Can one of you from VN team come up with proposal : I have crated wiki page here: https://github.com/ja-guzzle/docs/wikis/Azure/Best-Practices-for-Creating-Resources I you guys can fill up . I have put links to standard recommendation)
  3. Access control and keeping resources secured.
⚠️ **GitHub.com Fallback** ⚠️