Dataproc - bobbae/gcp GitHub Wiki
Dataproc is a managed service that can run Apache Spark, Apache Hadoop, Apache Flink, Presto, and 30+ open source tools and frameworks for batch processing, querying, streaming, data lake modernization, ETL, secure data science, and machine learning.
Quickstarts
https://cloud.google.com/dataproc/docs/quickstarts
https://codelabs.developers.google.com/codelabs/cloud-dataproc-gcloud
Migrating Hadoop to Dataproc
Enterprises are migrating their existing on-premises Apache Hadoop and Spark clusters over to Dataproc to manage costs and unlock the power of elastic scale.
https://www.youtube.com/watch?v=h1LvACJWjKc
Dataproc Serverless
Dataproc Serverless lets you run Spark batch workloads without requiring you to provision and manage your own cluster.
https://cloud.google.com/dataproc-serverless/docs
https://medium.com/geekculture/creating-serverless-spark-jobs-with-google-cloud-dd84c375947d
Fast export large database tables — using GCP Serverless Dataproc
Using Spark on Dataproc
Tuning Spark on Dataproc
https://mkuthan.github.io/blog/2022/03/24/gcp-dataproc-spark-tuning/
Running PySpark jobs on Serverless Dataproc
Ingesting Google Cloud Storage files to BigQuery using Cloud Functions and Serverless Spark
Cloud Spanner export query results using Dataproc Serverless
Serverless Spark components for Vertex AI Pipelines
https://cloud.google.com/vertex-ai/docs/pipelines/dataproc-component
Metastore
Dataproc Metastore is a managed Hive metastore that can be used as a centralized metadata repository that can be shared among various ephemeral Dataproc clusters running different open source components.
Component Gateway
Component Gateway provides secure access to web endpoints for Dataproc default and optional components.
Big Data analytics with Dataproc
Migrating Hadoop to Dataproc
https://www.youtube.com/watch?v=YK_-yS9y_0k
AWS EMR vs Dataproc
Dataproc Hadoop Data Storage
Dataproc integrates with Apache Hadoop and the Hadoop Distributed File System (HDFS).
https://cloud.google.com/dataproc/docs/concepts/dataproc-hdfs
Creating Dataproc cluster
Dataproc and Spark on Kubernetes
Examples
Hive and HBase
https://medium.com/@datacouch/big-data-processing-using-google-dataproc-d911d0b05313
Use the BigQuery connector with Dataproc Serverless for Spark
https://cloud.google.com/dataproc-serverless/docs/guides/bigquery-connector-spark-example
Dataproc serverless Spark for Delta Lake by Databricks
Tutorials
https://cloud.google.com/dataproc/docs/tutorials
Write Spark Scala Jobs (From Spark to DataProc)
Machine Learning with Spark on Google Cloud Dataproc
Distributed Image Processing in Cloud Dataproc
Using Apache Spark DStreams with Dataproc and Pub/Sub
Cloud Bigtable map reduce word count example with Dataproc
Install and run a Jupyter notebook on a Dataproc cluster
Apache Spark and Jupyter Notebooks made easy with Dataproc component gateway