Apache Spark on Databricks (AWS, Azure, or GCP) - synccomputingcode/user_documentation GitHub Wiki

Collecting Spark Event Logs

Obtaining the Spark History Server event logs requires enabling the cluster log delivery. In other words, if it is not turned on, you would have to update the settings on your jobs/clusters to enable it.

There are two ways to do so as hinted in the Databricks documentation. See Databricks guidance for AWS, Azure and GCP.

  1. Through the console by setting the Cluster Log Path, with either a location in dbfs or the platform's cloud storage (AWS s3, Google Cloud storage, or Azure Blob Storage destination) under Advanced Options.

  2. Or with an insertion to the new cluster entry of the cluster config json that is used to create clusters. It would be akin to inserting something like the below for dbfs.

"cluster_log_conf": {
        "dbfs": {
            "destination": "dbfs:/cluster-logs"
        }
    }

For s3, it may look like:

"cluster_log_conf": {
        "s3": {
            "destination": "s3://cluster-logs"
        }
    }

For Azure Blob Storage, it may look like:

"cluster_log_conf": {
        "wasb": {
            "destination": "wasbs://cluster-logs"
        }
    }

For Google Cloud Storage, it may look like:

"cluster_log_conf": {
        "gs": {
            "destination": "gs://cluster-logs"
        }
    }

For example, the relevant event log file(s) in the path below. The parts in {} may vary from job to job. There may be more than one event log file associated with one job run.

dbfs:/cluster-logs/{cluster_job_identifier}/eventlog/{another_cluster_job_identifier}/{numeric_identifier}/

Keep in mind that when storing to cloud storage, the destination needs to be in the same region as the cluster and the appropriate permissions of the destination need to be enabled for the Databricks IAM role.