GCP Dataproc - dennisholee/notes GitHub Wiki
Web portals
- Hadoop cluster
http://{MASTER_VM_IP}:8088
- Hadoop Administration interface
http://{MASTER_VM_IP}:9870
Command Lines
-
List clusters
cloud dataproc clusters list --region {REGION_NAME}
-
Create Dataproc cluster
gcloud dataproc clusters create {CLUSTER_NAME} \
--region ${MYREGION} \
--zone ${MYZONE} \
--num-masters 1 \ # 1-Standard, 3-High availability
--master-machine-type n1-standard-2 \
--master-boot-disk-size 100GB
--worker-machine-type n1-standard-2 \
--worker-boot-disk-size 50GB \
--num-workers 3 \
--bucket ${BUCKET} \
--tags {NETWORK_TAG} \ # Network firewall for restricted access
--image-version {IMAGE_VERSION}
--scopes https://www.googleapis.com/auth/cloud-platform # enables all API scopes
Dataproc HA (High Availability)
# Additional resiliency on the master node
gcloud dataproc clusters create cluster-name --num-masters 3
Note: For list of scopes see https://cloud.google.com/sdk/gcloud/reference/compute/instances/create.
Image version: https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions
-
Get attached bucket name
gcloud dataproc clusters describe dataproc-cluster --region us-central1 --format "value(config.configBucket)"
-
Get cluster's Cloud Storage and Compute instance details
# List bucket content (Refer to get bucket name command above)
gsutil ls -r gs://{BUCKET_NAME}
# List compute instances
gcloud compute instances list
- Submit PySpark job
gcloud dataproc jobs submit pyspark gs://${BUCKET_NAME}/{python_script} --cluster ${CLUSTER} --region {REGION}
Secure access to master
Add firewall rule to restrict access
- Add network tag
gcloud compute instances add-tags {CLUSTER_MASTER_VM} --tags {TAG_NAME} --zone {ZONE}
- Verify tag applied successfully
gcloud compute instances describe {CLUSTER_MASTER_VM} --zone {ZONE} --format "value(tags)"
- Add firewall rule
gcloud compute firewall-rules create {FIREWALL_RULE_NAME} --direction ingress --action allow --target-tags {TAG_NAME} --source-ranges "{RESTRICTED IP}/32" --rules "tcp:9870,tcp:8088"
- Verify configuration
gcloud compute firewall-rules describe {FIREWALL_RULE_NAME}