Apache Spark on GCP Dataproc - synccomputingcode/user_documentation GitHub Wiki

Obtaining cluster and Apache Spark configuration recommendations requires the cluster information and the Apache Spark event logs.

Collecting cluster information

The GCP gcloud command line utility can provide the necessary dataproc cluster information. See example command below: gcloud dataproc clusters describe {my_cluster} --region=us-central1

See guidance in GCP documentation here.

Enabling event log storage

To enable storing the event logs to an accessible Google cloud storage requires adding specific properties during cluster creation.

Example below is using the gcloud command line:

gcloud dataproc clusters create {cluster-name} \
    --region={region} \
    --image-version={version} \
    --enable-component-gateway \
    --properties='dataproc:job.history.to-gcs.enabled=true,
spark:spark.history.fs.logDirectory=gs://{bucket-name}/{directory}/spark-job-history,
spark:spark.eventLog.dir=gs://{bucket-name}/{directory}/spark-job-history'

If the Dataproc REST API is employed, the parameters are set under "softwareConfig". See example snippet below.

    "softwareConfig": {
        "properties": {
            "spark.history.fs.logDirectory": "gs://{bucket-name}/{directory}/spark-job-history",
            "spark.eventLog.dir": "gs://{bucket-name}/{directory}/spark-job-history"
        }
    }

After an Apache Spark job completes on a GCP Dataproc cluster, the event log can be found in the spark.eventLog.dir Google cloud storage location. The log typically is named application_{numeric identifier}

See guidance in GCP Dataproc documentation here.