Performance Tuning in Databricks Enviornment - ja-guzzle/guzzle_docs GitHub Wiki

Table of Contents

Databricks Cluster Tuning

Using Cluster Pool

Background

  1. When submitting jobs from Guzzle using spark env which is configured using Data Engg or Data Engg Light, a new cluster is started for every job:

image

  1. More on diff between Job Cluster ( or Automated or Data Engineering /Light - all mean same from how Guzzle handles just that Light is cheaper) and Interactive cluster is here: https://docs.databricks.com/clusters/index.html
  2. The time taken to run the will be following:
    1. Databricks to securing the VMs with OS images to bring up cluster as per the worker specified when defining spark env
    2. Loading databricks runtime (DBR)
    3. Guzzle installing all the libraries that we need
    4. Actual running of the job
  3. Guzzle runtime Audit only shows the start and end time of the actual job (item 4 above)

image

  1. Using cluster pool can significantly improve the performance and hence strongly suggested to use this. More on cluster pool at: https://docs.microsoft.com/en-us/azure/databricks/clusters/instance-pools/ . It keeps the VMs with DBR alive for xx minute before destroying them.

Cluster pool explained

  1. A pool is logical group of VMs with idle time of the pool as xx minute Individual VM will be released after it reaches 5 xx min idle time
  2. Its similar to connection pools which - which have timeout before a idel connection is freed up (as it holds resource son database side if we kept it on)
  3. When a request of creating new cluster comes (with cluster id specified when creating cluster) idle VM availble are used to create the new cluster.
  4. If there is not enough VM, the new ones are procured and will take upto five minutes including setting up DBR on it

Using Cluster Pool

  1. Cluster pools are similar to "connection pools" (or any pooling design pattern you are aware of)
  2. Define cluster pool on Databricks workspace. You basically specify few important things - like how many minimum instances to keep even after they hit idle time, what is the max idel intaces it should keep ( suggest to keep 0 for min and blank for max for now) as keeping some instances alive always will cost VM meter

image

  1. On Guzzle its quite simple, just update the pool id

image

Results

Without pool time time it takes to bring up cluster is high (4 min 12 second):

image

However when using pool its 32 seconds to bring up cluster

image

The Guzzle job still spends time to load the library - which is around 80 second (we are working on optimizing this) The reaming 20-25 seconds to run the actual job

Cost Saving with cluster pools

Without Cluster pool: 50 jobs x (5 min of cluster start+ 1 min of job time) = 300 min for VM cost, 50 min of DBU

With cluster pool: 1st job 5 min of job start + 1 min of jon time + 49 jobs x (1 min of job start + 1 min of job time) + 5 min of idle time for VM = 110 min of VM cost (of which 5 min is idel time) + 50 min of DBU cost

Using DBFS

  1. DBFS provides exceptionally best performance
  2. You can refer DBFS file system as local file system end point in Guzzle using /dbfs/xxx or DBFS file system end point: /mnt/data...
  3. The file are cached and I presume even unzipped when its accessed again. In Analytics cluster it works exceptionally fast

Using Delta

Configuration tuning

SQL Tuning

Benchmark Results

  1. Benchmark using DE and DA clusters with cluster pools and without

test-results_summarized.xlsx

⚠️ **GitHub.com Fallback** ⚠️