Securing Datalakes underpin by Databricks and ADLS gen2

Objective:
Test 1: Using two workspace (premium tier for Analytics) and standard for tier sharing same catalog
Details
Verification
Test 2
Details
Verification
Test 3
Details

Objective:

To be able to secure the Data lake build using ADLS gen2 and Databricks serving as SQL engine, notebook env for user to play with data and Data engineering compute to run the data pieplines

Requirement is to:

To ensure data access control is via single channel - which is via SQL /table access
The direct access to ADLS is not provided at all - and if its provided then its granular via SQL server
A general purpose file sharing should be done via different storage container in ADLS or Blobs and use standard access control provided by Azure (for ADLS it supports ACLs plus RBAC at Storage account and container level)
Flexibility of using "premium tier" work space to run sql and analytics workload and standard tier an d job cluster cluster for direct access

Advanced security scenario includes:

Network security to restrict ADLS gen2 , JDBC connection to Databricks and Conteplace (DB notebook) from specific network zone
Preventing Data download from selected database/tables
Column and row level security - for now assumed to be using views
Opening up ADLS from other channels like SQL Warehouse has to be restricted via service princpal which has read only account and the SQL warehoues inturn applies accurate security mechanism to prevent data access control
Compare and contrast of accessing data in data lake via PBI using pass through authentication or using service principals to configure the connection and the implications

Test 1: Using two workspace (premium tier for Analytics) and standard for tier sharing same catalog

Details

two workspace one premium, one standard , pointing to same external metastore - whether this works for both hive and delta tables. The idea is that we want to use job cluster on standard workspace to run the jobs and have Analytics cluster /interactive cluster running on premium workspace for end user to go to notebook and query

Verification

db workspace (standard): guzzledemows cluster (Standard, no restrictions): test-ac-external-metastore

db workspace (premium): accesscontroltest cluster (High Concurrency, table acl enable - sql,pyspark only): test-py-restrictions

both clusters in each db workspace are configured with same external hive metastore mounted adls gen2 storage on same dbfs path in both workspace

using test-ac-external-metastore cluster and scala notebook:

created 2 databases (demodb, demodb2) on adls gen2
created delta table delta1 and hive table hive1 in both databases
inserted 2 records using notebook (insert statement) in all tables

using test-py-restrictions cluster and python notebook:

add non-admin user who don't have permission to create new clusters and has permission to attach python notebooks to permitted clusters
grant select on only demodb database to non-admin user

Test 2

Details

If any better way to manage access control between ADLS and tables - one is to have it pass thru , which means user access the tables , his credential are passed to access data from ADLS (and that implies he should have access to ADLS for those tables) vs have fixed service principal to mount ADLS as DBFS and use that for creating DB

Verification

Test 3

Details

For applying Table Access Control to ensure all the Delta lake and hive tables on ADLS are restricted using GRANT/REVOKE command (https://docs.databricks.com/administration-guide/admin-settings/table-acls/table-acl.html) we are planning to use below approach: Create ADLS gen2 mount which is only readable by service principle (client id and secret) This service principles are not visible to user The service principle are used to restrict

Securing Datalakes underpin by Databricks and ADLS gen2 - ja-guzzle/guzzle_docs GitHub Wiki

Table of Contents

Objective:

Test 1: Using two workspace (premium tier for Analytics) and standard for tier sharing same catalog

Details

Verification

Test 2

Details

Verification

Test 3

Details

⚠️ GitHub.com Fallback ⚠️

Securing Datalakes underpin by Databricks and ADLS gen2 - ja-guzzle/guzzle_docs GitHub Wiki

Table of Contents

Objective:

Test 1: Using two workspace (premium tier for Analytics) and standard for tier sharing same catalog

Details

Verification

Test 2

Details

Verification

Test 3

Details

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️