Performance Tuning in Databricks Enviornment - ja-guzzle/guzzle_docs GitHub Wiki
- Databricks Cluster Tuning
- Using Cluster Pool
- Background
- Cluster pool explained
- Using Cluster Pool
- Results
- Cost Saving with cluster pools
- Using DBFS
- Using Delta
- Configuration tuning
- SQL Tuning
- Benchmark Results
- When submitting jobs from Guzzle using spark env which is configured using Data Engg or Data Engg Light, a new cluster is started for every job:
- More on diff between Job Cluster ( or Automated or Data Engineering /Light - all mean same from how Guzzle handles just that Light is cheaper) and Interactive cluster is here: https://docs.databricks.com/clusters/index.html
- The time taken to run the will be following:
- Databricks to securing the VMs with OS images to bring up cluster as per the worker specified when defining spark env
- Loading databricks runtime (DBR)
- Guzzle installing all the libraries that we need
- Actual running of the job
- Guzzle runtime Audit only shows the start and end time of the actual job (item 4 above)
- Using cluster pool can significantly improve the performance and hence strongly suggested to use this. More on cluster pool at: https://docs.microsoft.com/en-us/azure/databricks/clusters/instance-pools/ . It keeps the VMs with DBR alive for xx minute before destroying them.
- A pool is logical group of VMs with idle time of the pool as xx minute Individual VM will be released after it reaches 5 xx min idle time
- Its similar to connection pools which - which have timeout before a idel connection is freed up (as it holds resource son database side if we kept it on)
- When a request of creating new cluster comes (with cluster id specified when creating cluster) idle VM availble are used to create the new cluster.
- If there is not enough VM, the new ones are procured and will take upto five minutes including setting up DBR on it
- Cluster pools are similar to "connection pools" (or any pooling design pattern you are aware of)
- Define cluster pool on Databricks workspace. You basically specify few important things - like how many minimum instances to keep even after they hit idle time, what is the max idel intaces it should keep ( suggest to keep 0 for min and blank for max for now) as keeping some instances alive always will cost VM meter
- On Guzzle its quite simple, just update the pool id
Without pool time time it takes to bring up cluster is high (4 min 12 second):
However when using pool its 32 seconds to bring up cluster
The Guzzle job still spends time to load the library - which is around 80 second (we are working on optimizing this) The reaming 20-25 seconds to run the actual job
Without Cluster pool: 50 jobs x (5 min of cluster start+ 1 min of job time) = 300 min for VM cost, 50 min of DBU
With cluster pool: 1st job 5 min of job start + 1 min of jon time + 49 jobs x (1 min of job start + 1 min of job time) + 5 min of idle time for VM = 110 min of VM cost (of which 5 min is idel time) + 50 min of DBU cost
- DBFS provides exceptionally best performance
- You can refer DBFS file system as local file system end point in Guzzle using /dbfs/xxx or DBFS file system end point: /mnt/data...
- The file are cached and I presume even unzipped when its accessed again. In Analytics cluster it works exceptionally fast
- Benchmark using DE and DA clusters with cluster pools and without