Dataflow - bobbae/gcp GitHub Wiki

Dataflow is a managed service for executing a wide variety of data processing patterns.

Google Cloud Dataflow makes it easy to process and analyze real-time streaming data so that you can derive insights and react to new information in real-time.

https://www.youtube.com/watch?v=7lJyq1hw_KI

Dataflow uses your pipeline code to create an execution graph that represents your pipeline's PCollections and transforms.

https://www.youtube.com/watch?v=cqDBnOaS6O4

Apache Beam

The Apache Beam SDK is an open source programming model that enables you to develop both batch and streaming pipelines.

You create your pipelines with an Apache Beam program and then run them on the Dataflow service.

The Apache Beam documentation provides in-depth conceptual information and reference material for the Apache Beam programming model, SDKs, and other runners.

Apache Beam SQL

https://medium.com/syntio/data-processing-with-dataflow-sql-part-1-2-fe57e47f4bb0

Apache Beam stream processing

https://mkuthan.github.io/blog/2022/01/28/stream-processing-part1/

https://mkuthan.github.io/blog/2022/03/08/stream-processing-part2/

SCIO Scala API

https://cloud.google.com/blog/products/data-analytics/developing-beam-pipelines-using-scala/

Alternatives

Cloud Dataflow is not the first big data processing engine. You can use Apache Spark in Google Cloud Dataproc Service, Cloud Composer based on Airflow and Cloud Workflows.

https://github.com/manuzhang/awesome-streaming

Why Dataflow?

  • Serverless: We don’t have to manage computing resources
  • Processing code is separate from the execution environment
  • Processing batch and stream mode with the same programming model

https://beam.apache.org/documentation/programming-guide/

Firestore native connector

https://cloud.google.com/blog/products/databases/apache-beam-firestore-connector-released

Getting Started Tutorial

https://medium.com/bb-tutorials-and-thoughts/how-to-get-started-with-gcp-dataflow-822295dce7b4

Data pipeline

https://doppelfelix.medium.com/pipeline-in-the-cloud-6edb007c4d52

Tag Manager

https://cloud.google.com/blog/products/data-analytics/learn-beam-patterns-with-clickstream-processing-of-google-tag-manager-data

Python support

https://cloud.google.com/blog/products/data-analytics/debunking-myths-about-python-on-dataflow

Examples

Word Count Example

https://beam.apache.org/get-started/wordcount-example/

Dataflow pipeline example: PubSub to GCS

https://jtaras.medium.com/building-a-simple-google-cloud-dataflow-pipeline-pubsub-to-google-cloud-storage-9bbf170e8bad

Preprocessing data

https://doppelfelix.medium.com/using-apache-beam-to-automate-your-preprocessing-in-data-science-144a89392f15

Dataflow Mobile Gaming Example

https://beam.apache.org/get-started/mobile-gaming-example/

Streaming into BigQuery using dataflow

https://medium.com/antvoice-tech/how-we-are-streaming-thousands-of-rows-per-second-into-bigquery-part-i-google-cloud-dataflow-9465fdcd436d

Example Apache Beam Notebook

In this notebook, we set up your development environment and work through a simple example using the DirectRunner. You can explore other runners with the Beam Capability Matrix.

Creating a pipeline using Apache Beam interactive runner

Using the Apache Beam interactive runner with JupyterLab notebooks lets you iteratively develop pipelines, inspect your pipeline graph, and parse individual PCollections in a read-eval-print-loop (REPL) workflow. These Apache Beam notebooks are made available through AI Platform Notebooks.

Interactive Pipeline development using Apache Beam notebooks

https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development

Interactive Beam Pipeline Notebook ML inference at scale

https://cloud.google.com/blog/products/data-analytics/interactive-beam-pipeline-ml-inference-at-scale-in-notebook/

Dataflow Templates

The Dataflow templates are an effort to solve simple, but large, in-Cloud data tasks, including data import/export/backup/restore and bulk API operations, without a development environment. The technology under the hood which makes these operations possible is the Google Cloud Dataflow service combined with a set of Apache Beam SDK templated pipelines.

https://cloud.google.com/blog/products/data-analytics/dataflow-templates-gets-your-data-into-motion

Using provided Dataflow templates

Google provides a set of open-source Dataflow templates.

Creating Dataflow templates

Dataflow templates use runtime parameters to accept values that are only available during pipeline execution. To customize the execution of a templated pipeline, you can pass these parameters to functions that run within the pipeline (such as a DoFn).

To create a template from your Apache Beam pipeline, you must modify your pipeline code to support runtime parameters.

https://cloud.google.com/dataflow/docs/guides/templates/creating-templates

Running Dataflow templates

https://cloud.google.com/dataflow/docs/guides/templates/running-templates

Templates for Elastic Stack

https://cloud.google.com/blog/products/data-analytics/dataflow-templates-for-elastic-cloud

UDF

A UDF is a JavaScript snippet that implements a simple element processing logic, and is provided as an input parameter to the Dataflow pipeline. The UDF JavaScript code runs on Nashorn JavaScript engine included in the Dataflow worker’s Java runtime (applicable for Java pipelines such as Google-provided Dataflow templates). The code is invoked locally by a Dataflow worker for each element separately. Element payloads are serialized and passed as JSON strings back and forth.

Using Dataflow Flex templates

Dataflow Flex templates use docker containers.

https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#python

Migrating from App Engine Map/reduce to Dataflow

https://cloud.google.com/dataflow/docs/guides/gae-mapreduce-migration

ETL From Relational DB into BigQuery using DataFlow

Logs at scale

https://cloud.google.com/architecture/processing-logs-at-scale-using-dataflow

Streaming data to BigQuery with Dataflow

https://alexfragotsis.medium.com/streaming-data-to-bigquery-with-dataflow-and-real-time-schema-updating-c7a3deba3bad

PubSub to BigQuery

https://towardsdatascience.com/pubsub-to-bigquery-how-to-build-a-data-pipeline-using-dataflow-apache-beam-and-java-afdeb8e711c2

Export data to Elastic Search via Dataflow templates

https://cloud.google.com/blog/products/data-analytics/export-google-cloud-data-into-elasticsearch-with-dataflow-templates

Dataflow CICD with Github actions

https://medium.com/everything-full-stack/dataflow-ci-cd-with-github-actions-65765f09713f

Import data into Firestore

https://medium.com/@larry_nguyen/how-to-import-data-into-google-firestore-2c31614c567c

Linear programming on streaming data within an Apache Beam pipeline

https://lakshmanok.medium.com/how-to-do-product-mix-optimization-in-real-time-d79ac1bf1c97

Wordle

https://medium.com/@inigosj/how-to-properly-play-wordle-using-dataflow-and-bigquery-825d2f4099ac

Apache Beam error handling with Kotlin and Asgarde

https://medium.com/@mazlum.tosun/error-handling-with-apache-beam-asgarde-with-kotlin-8b742fca120e

Splunk

https://cloud.google.com/architecture/deploying-production-ready-log-exports-to-splunk-using-dataflow

https://cloud.google.com/blog/products/data-analytics/simplify-your-splunk-dataflow-ops-with-improved-pipeline-observability

Twitter

https://cloud.google.com/blog/products/data-analytics/how-twitter-modernized-its-data-processing-with-google-cloud

Data Masking with Tokenization using Google Cloud DLP and Google Cloud Dataflow

https://medium.com/@ardhanyoga/data-masking-with-tokenization-using-google-cloud-dlp-and-google-cloud-dataflow-8bba3cc76ef6

Stopping

You cannot delete a Dataflow job; you can only stop it.

https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline