DataProc - bobbae/gcp GitHub Wiki

Dataproc is a managed service that can run Apache Spark, Apache Hadoop, Apache Flink, Presto, and 30+ open source tools and frameworks for batch processing, querying, streaming, data lake modernization, ETL, secure data science, and machine learning.

Quickstarts

https://cloud.google.com/dataproc/docs/quickstarts

https://codelabs.developers.google.com/codelabs/cloud-dataproc-gcloud

Migrating Hadoop to Dataproc

Enterprises are migrating their existing on-premises Apache Hadoop and Spark clusters over to Dataproc to manage costs and unlock the power of elastic scale.

https://www.youtube.com/watch?v=h1LvACJWjKc

Dataproc Serverless

Dataproc Serverless lets you run Spark batch workloads without requiring you to provision and manage your own cluster.

https://cloud.google.com/dataproc-serverless/docs

https://medium.com/geekculture/creating-serverless-spark-jobs-with-google-cloud-dd84c375947d

Fast export large database tables — using GCP Serverless Dataproc

https://medium.com/google-cloud/fast-export-large-database-tables-using-gcp-serverless-dataproc-spark-bb32b1260268

Using Spark on Dataproc

Tuning Spark on Dataproc

https://mkuthan.github.io/blog/2022/03/24/gcp-dataproc-spark-tuning/

Running PySpark jobs on Serverless Dataproc

https://medium.com/cts-technologies/running-pyspark-jobs-on-google-cloud-using-serverless-dataproc-f16cef5ec6b9

Ingesting Google Cloud Storage files to BigQuery using Cloud Functions and Serverless Spark

https://cloud.google.com/blog/products/data-analytics/ingesting-data-into-bigquery-using-serverless-spark

Cloud Spanner export query results using Dataproc Serverless

https://medium.com/google-cloud/cloud-spanner-export-query-results-using-dataproc-serverless-6f2f65b583a4

Serverless Spark components for Vertex AI Pipelines

https://cloud.google.com/vertex-ai/docs/pipelines/dataproc-component

Metastore

Dataproc Metastore is a managed Hive metastore that can be used as a centralized metadata repository that can be shared among various ephemeral Dataproc clusters running different open source components.

Component Gateway

Component Gateway provides secure access to web endpoints for Dataproc default and optional components.

Big Data analytics with Dataproc

https://programmaticponderings.com/2018/12/11/big-data-analytics-with-java-and-python-using-cloud-dataproc-googles-fully-managed-spark-and-hadoop-service/

Migrating Hadoop to Dataproc

https://www.youtube.com/watch?v=YK_-yS9y_0k

AWS EMR vs Dataproc

https://hackernoon.com/why-dataproc-googles-managed-hadoop-and-spark-offering-is-a-game-changer-9f0ed183fda3

Dataproc Hadoop Data Storage

Dataproc integrates with Apache Hadoop and the Hadoop Distributed File System (HDFS).

https://cloud.google.com/dataproc/docs/concepts/dataproc-hdfs

Creating Dataproc cluster

https://medium.com/google-cloud/creating-a-dataproc-cluster-considerations-gotchas-resources-930294ddb7f3

Dataproc and Spark on Kubernetes

https://cloud.google.com/blog/products/infrastructure-modernization/running-spark-on-kubernetes-with-dataproc

Examples

Hive and HBase

https://medium.com/@datacouch/big-data-processing-using-google-dataproc-d911d0b05313

Use the BigQuery connector with Dataproc Serverless for Spark

https://cloud.google.com/dataproc-serverless/docs/guides/bigquery-connector-spark-example

Dataproc serverless Spark for Delta Lake by Databricks

https://medium.com/google-cloud/processing-databricks-delta-lake-data-in-google-cloud-dataproc-serverless-for-spark-1cc1405a3ee4

Tutorials

https://cloud.google.com/dataproc/docs/tutorials

Write Spark Scala Jobs (From Spark to DataProc)

Hadoop to GCP

Machine Learning with Spark on Google Cloud Dataproc

Distributed Image Processing in Cloud Dataproc

Using Apache Spark DStreams with Dataproc and Pub/Sub

Cloud Bigtable map reduce word count example with Dataproc

Install and run a Jupyter notebook on a Dataproc cluster

Apache Spark and Jupyter Notebooks made easy with Dataproc component gateway

Cloud Dataproc example repository

Use Presto with Dataproc

⚠️ **GitHub.com Fallback** ⚠️