Securing Datalakes underpin by Databricks and ADLS gen2 - ja-guzzle/guzzle_docs GitHub Wiki
- Objective:
- Test 1: Using two workspace (premium tier for Analytics) and standard for tier sharing same catalog
- Details
- Verification
- Test 2
- Details
- Verification
- Test 3
- Details
To be able to secure the Data lake build using ADLS gen2 and Databricks serving as SQL engine, notebook env for user to play with data and Data engineering compute to run the data pieplines
Requirement is to:
- To ensure data access control is via single channel - which is via SQL /table access
- The direct access to ADLS is not provided at all - and if its provided then its granular via SQL server
- A general purpose file sharing should be done via different storage container in ADLS or Blobs and use standard access control provided by Azure (for ADLS it supports ACLs plus RBAC at Storage account and container level)
- Flexibility of using "premium tier" work space to run sql and analytics workload and standard tier an d job cluster cluster for direct access
Advanced security scenario includes:
- Network security to restrict ADLS gen2 , JDBC connection to Databricks and Conteplace (DB notebook) from specific network zone
- Preventing Data download from selected database/tables
- Column and row level security - for now assumed to be using views
- Opening up ADLS from other channels like SQL Warehouse has to be restricted via service princpal which has read only account and the SQL warehoues inturn applies accurate security mechanism to prevent data access control
- Compare and contrast of accessing data in data lake via PBI using pass through authentication or using service principals to configure the connection and the implications
two workspace one premium, one standard , pointing to same external metastore - whether this works for both hive and delta tables. The idea is that we want to use job cluster on standard workspace to run the jobs and have Analytics cluster /interactive cluster running on premium workspace for end user to go to notebook and query
db workspace (standard): guzzledemows cluster (Standard, no restrictions): test-ac-external-metastore
db workspace (premium): accesscontroltest cluster (High Concurrency, table acl enable - sql,pyspark only): test-py-restrictions
both clusters in each db workspace are configured with same external hive metastore mounted adls gen2 storage on same dbfs path in both workspace
using test-ac-external-metastore cluster and scala notebook:
- created 2 databases (demodb, demodb2) on adls gen2
- created delta table delta1 and hive table hive1 in both databases
- inserted 2 records using notebook (insert statement) in all tables
using test-py-restrictions cluster and python notebook:
- add non-admin user who don't have permission to create new clusters and has permission to attach python notebooks to permitted clusters
- grant select on only demodb database to non-admin user
If any better way to manage access control between ADLS and tables - one is to have it pass thru , which means user access the tables , his credential are passed to access data from ADLS (and that implies he should have access to ADLS for those tables) vs have fixed service principal to mount ADLS as DBFS and use that for creating DB
For applying Table Access Control to ensure all the Delta lake and hive tables on ADLS are restricted using GRANT/REVOKE command (https://docs.databricks.com/administration-guide/admin-settings/table-acls/table-acl.html) we are planning to use below approach: Create ADLS gen2 mount which is only readable by service principle (client id and secret) This service principles are not visible to user The service principle are used to restrict