Azure Databricks - amitbhilagude/userfullinks GitHub Wiki

  1. Overview
    1. Control Plane
      1. Managed in Azure Databricks Cloud
      2. It contains Web App, Notebooks, Jobs, Metastore and Cluster manager
    2. Data Plane
      1. Data sources
      2. Cluster
      3. Secure Integrations
    3. Recommended Architecture
      1. Azure Data Lake Storage Gen V2 as it support hierarchical namespace and Delta format
      2. Power BI to connection Databricks for Adhoc analysis. Alternate option is Data Lake to Azure Synapse Data Warehouse to Power BI. Which is not recommended as you already have dabricks to Power BI
      3. Azure Databricks to SQL DB\Cosmos DB which will be consumed by Web app
      4. ML Flow is natively available in Azure Databricks as well as Azure ML Services.
      5. Serving of Models has two options.
        1. Serve from Databricks
        2. Serve from AKS. THis is good option in case databrick doesn't need to expose to the customers
    4. Azure Data Lake Storage
      1. Data Lake Storage will have Raw Format data ingested from ADF\Fivetran
      2. Databricks will convert it into Delta Fromat in Bronze table. It is good idea to have seprate Bronze table than raw where other sources e.g. Real time data can be also stored into Bronze table by Databricks where it will have separation of concern of ADF raw data.
      3. Databricks will convert to table from Bronze to silver, Silver to Gold and ML, or other Reports will directly use the Gold Data or unity catalog
    5. Workspace Setup
      1. Subscription, Resource Group and Region should be same
      2. Name of the Workspace
      3. Storage account for Data
      4. Either user existing vnet or newly created vnet
      5. Managed resource Group created by Azure Databricks with Vnet Injections has Managed identity and Storage account. It is managed by Databricks so you will not able to see any resources or container in Databricks
  2. Lakehouse
    1. Stores in Delta format which has lot benefits like ACID transactions, Time travel, Audit log etc.
    2. Photon : Photon is query engine and it doesn't lot of cost saving about 80% cost. It increases speed in data lake.
    3. Databricks Serverless SQL is new offering where data plane is managed by Databricks. It is cost optmisation solutions similar to consumption plan
  3. Unity catalog
    1. Metastore which will allow to control data store into Cloud Storage.
    2. It consists of
      1. Catalog -> Schema -> Table/View/Functions
      2. External Location: To store data into external locations
      3. Credential storage: To manage credentials to access the cloud storage
      4. Delta Sharing: Feature to allow share data outside the organisation
      5. Permission management
  4. Databricks Features
    1. Data Engineering
      1. Delta live tables
      2. Workflows
      3. Auto Loader
      4. Copy to
    2. Data warehouse
      1. DBT: Data build tool is able transform data by enabling select statements
    3. Data streaming
    4. Data Science and Machine Learning
      1. ML Flow
      2. Auto ML : No code
      3. ML Ops
  5. Tools
    1. DBUtil: available in the notebook of python, sql
    2. Databricks CLI: Command line interface
    3. Databricks APIs
  6. Databricks notebook
    1. Similar to Jupiter notebook with extension as dc.
    2. Databricks commands start with %. It allows to change the language in a cell and run another notebook command