GCP Dataproc - dennisholee/notes GitHub Wiki

Web portals

  • Hadoop cluster http://{MASTER_VM_IP}:8088
  • Hadoop Administration interface http://{MASTER_VM_IP}:9870

Command Lines

  • List clusters cloud dataproc clusters list --region {REGION_NAME}

  • Create Dataproc cluster

gcloud dataproc clusters create {CLUSTER_NAME} \ 
  --region ${MYREGION} \
  --zone ${MYZONE} \
  --num-masters 1 \ # 1-Standard, 3-High availability 
  --master-machine-type n1-standard-2 \
  --master-boot-disk-size 100GB 
  --worker-machine-type n1-standard-2 \
  --worker-boot-disk-size 50GB \
  --num-workers 3 \
  --bucket ${BUCKET} \
  --tags {NETWORK_TAG} \ # Network firewall for restricted access
  --image-version {IMAGE_VERSION}
  --scopes https://www.googleapis.com/auth/cloud-platform # enables all API scopes

Dataproc HA (High Availability)

# Additional resiliency on the master node
gcloud dataproc clusters create cluster-name --num-masters 3

Note: For list of scopes see https://cloud.google.com/sdk/gcloud/reference/compute/instances/create.

Image version: https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions

  • Get attached bucket name gcloud dataproc clusters describe dataproc-cluster --region us-central1 --format "value(config.configBucket)"

  • Get cluster's Cloud Storage and Compute instance details

# List bucket content (Refer to get bucket name command above)
gsutil ls -r gs://{BUCKET_NAME}

# List compute instances
gcloud compute instances list
  • Submit PySpark job
gcloud dataproc jobs submit pyspark  gs://${BUCKET_NAME}/{python_script} --cluster ${CLUSTER} --region {REGION} 

Secure access to master

Add firewall rule to restrict access

  1. Add network tag
gcloud compute instances add-tags {CLUSTER_MASTER_VM} --tags {TAG_NAME} --zone {ZONE}
  1. Verify tag applied successfully
gcloud compute instances describe {CLUSTER_MASTER_VM} --zone {ZONE} --format "value(tags)"
  1. Add firewall rule
gcloud compute firewall-rules create {FIREWALL_RULE_NAME} --direction ingress --action allow --target-tags {TAG_NAME} --source-ranges "{RESTRICTED IP}/32" --rules "tcp:9870,tcp:8088"
  1. Verify configuration
gcloud compute firewall-rules describe {FIREWALL_RULE_NAME}

Useful Links