Azure Databricks - amitbhilagude/userfullinks GitHub Wiki
Overview
Control Plane
Managed in Azure Databricks Cloud
It contains Web App, Notebooks, Jobs, Metastore and Cluster manager
Data Plane
Data sources
Cluster
Secure Integrations
Recommended Architecture
Azure Data Lake Storage Gen V2 as it support hierarchical namespace and Delta format
Power BI to connection Databricks for Adhoc analysis. Alternate option is Data Lake to Azure Synapse Data Warehouse to Power BI. Which is not recommended as you already have dabricks to Power BI
Azure Databricks to SQL DB\Cosmos DB which will be consumed by Web app
ML Flow is natively available in Azure Databricks as well as Azure ML Services.
Serving of Models has two options.
Serve from Databricks
Serve from AKS. THis is good option in case databrick doesn't need to expose to the customers
Azure Data Lake Storage
Data Lake Storage will have Raw Format data ingested from ADF\Fivetran
Databricks will convert it into Delta Fromat in Bronze table. It is good idea to have seprate Bronze table than raw where other sources e.g. Real time data can be also stored into Bronze table by Databricks where it will have separation of concern of ADF raw data.
Databricks will convert to table from Bronze to silver, Silver to Gold and ML, or other Reports will directly use the Gold Data or unity catalog
Workspace Setup
Subscription, Resource Group and Region should be same
Name of the Workspace
Storage account for Data
Either user existing vnet or newly created vnet
Managed resource Group created by Azure Databricks with Vnet Injections has Managed identity and Storage account. It is managed by Databricks so you will not able to see any resources or container in Databricks
Lakehouse
Stores in Delta format which has lot benefits like ACID transactions, Time travel, Audit log etc.
Photon : Photon is query engine and it doesn't lot of cost saving about 80% cost. It increases speed in data lake.
Databricks Serverless SQL is new offering where data plane is managed by Databricks. It is cost optmisation solutions similar to consumption plan
Unity catalog
Metastore which will allow to control data store into Cloud Storage.
It consists of
Catalog -> Schema -> Table/View/Functions
External Location: To store data into external locations
Credential storage: To manage credentials to access the cloud storage
Delta Sharing: Feature to allow share data outside the organisation
Permission management
Databricks Features
Data Engineering
Delta live tables
Workflows
Auto Loader
Copy to
Data warehouse
DBT: Data build tool is able transform data by enabling select statements
Data streaming
Data Science and Machine Learning
ML Flow
Auto ML : No code
ML Ops
Tools
DBUtil: available in the notebook of python, sql
Databricks CLI: Command line interface
Databricks APIs
Databricks notebook
Similar to Jupiter notebook with extension as dc.
Databricks commands start with %. It allows to change the language in a cell and run another notebook command