Dataflow - bobbae/gcp GitHub Wiki
Dataflow is a managed service for executing a wide variety of data processing patterns.
Google Cloud Dataflow makes it easy to process and analyze real-time streaming data so that you can derive insights and react to new information in real-time.
https://www.youtube.com/watch?v=7lJyq1hw_KI
Dataflow uses your pipeline code to create an execution graph that represents your pipeline's PCollections and transforms.
https://www.youtube.com/watch?v=cqDBnOaS6O4
Apache Beam
The Apache Beam SDK is an open source programming model that enables you to develop both batch and streaming pipelines.
You create your pipelines with an Apache Beam program and then run them on the Dataflow service.
The Apache Beam documentation provides in-depth conceptual information and reference material for the Apache Beam programming model, SDKs, and other runners.
Apache Beam SQL
https://medium.com/syntio/data-processing-with-dataflow-sql-part-1-2-fe57e47f4bb0
Apache Beam stream processing
https://mkuthan.github.io/blog/2022/01/28/stream-processing-part1/
https://mkuthan.github.io/blog/2022/03/08/stream-processing-part2/
SCIO Scala API
https://cloud.google.com/blog/products/data-analytics/developing-beam-pipelines-using-scala/
Alternatives
Cloud Dataflow is not the first big data processing engine. You can use Apache Spark in Google Cloud Dataproc Service, Cloud Composer based on Airflow and Cloud Workflows.
https://github.com/manuzhang/awesome-streaming
Why Dataflow?
- Serverless: We don’t have to manage computing resources
- Processing code is separate from the execution environment
- Processing batch and stream mode with the same programming model
https://beam.apache.org/documentation/programming-guide/
Firestore native connector
https://cloud.google.com/blog/products/databases/apache-beam-firestore-connector-released
Getting Started Tutorial
https://medium.com/bb-tutorials-and-thoughts/how-to-get-started-with-gcp-dataflow-822295dce7b4
Data pipeline
https://doppelfelix.medium.com/pipeline-in-the-cloud-6edb007c4d52
Tag Manager
Python support
https://cloud.google.com/blog/products/data-analytics/debunking-myths-about-python-on-dataflow
Examples
Word Count Example
https://beam.apache.org/get-started/wordcount-example/
Dataflow pipeline example: PubSub to GCS
Preprocessing data
Dataflow Mobile Gaming Example
https://beam.apache.org/get-started/mobile-gaming-example/
Streaming into BigQuery using dataflow
Example Apache Beam Notebook
In this notebook, we set up your development environment and work through a simple example using the DirectRunner. You can explore other runners with the Beam Capability Matrix.
Creating a pipeline using Apache Beam interactive runner
Using the Apache Beam interactive runner with JupyterLab notebooks lets you iteratively develop pipelines, inspect your pipeline graph, and parse individual PCollections in a read-eval-print-loop (REPL) workflow. These Apache Beam notebooks are made available through AI Platform Notebooks.
Interactive Pipeline development using Apache Beam notebooks
https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development
Interactive Beam Pipeline Notebook ML inference at scale
Dataflow Templates
The Dataflow templates are an effort to solve simple, but large, in-Cloud data tasks, including data import/export/backup/restore and bulk API operations, without a development environment. The technology under the hood which makes these operations possible is the Google Cloud Dataflow service combined with a set of Apache Beam SDK templated pipelines.
https://cloud.google.com/blog/products/data-analytics/dataflow-templates-gets-your-data-into-motion
Using provided Dataflow templates
Google provides a set of open-source Dataflow templates.
Creating Dataflow templates
Dataflow templates use runtime parameters to accept values that are only available during pipeline execution. To customize the execution of a templated pipeline, you can pass these parameters to functions that run within the pipeline (such as a DoFn).
To create a template from your Apache Beam pipeline, you must modify your pipeline code to support runtime parameters.
https://cloud.google.com/dataflow/docs/guides/templates/creating-templates
Running Dataflow templates
https://cloud.google.com/dataflow/docs/guides/templates/running-templates
Templates for Elastic Stack
https://cloud.google.com/blog/products/data-analytics/dataflow-templates-for-elastic-cloud
UDF
A UDF is a JavaScript snippet that implements a simple element processing logic, and is provided as an input parameter to the Dataflow pipeline. The UDF JavaScript code runs on Nashorn JavaScript engine included in the Dataflow worker’s Java runtime (applicable for Java pipelines such as Google-provided Dataflow templates). The code is invoked locally by a Dataflow worker for each element separately. Element payloads are serialized and passed as JSON strings back and forth.
Using Dataflow Flex templates
Dataflow Flex templates use docker containers.
https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#python
Migrating from App Engine Map/reduce to Dataflow
https://cloud.google.com/dataflow/docs/guides/gae-mapreduce-migration
ETL From Relational DB into BigQuery using DataFlow
Logs at scale
https://cloud.google.com/architecture/processing-logs-at-scale-using-dataflow
Streaming data to BigQuery with Dataflow
PubSub to BigQuery
Export data to Elastic Search via Dataflow templates
Dataflow CICD with Github actions
https://medium.com/everything-full-stack/dataflow-ci-cd-with-github-actions-65765f09713f
Import data into Firestore
https://medium.com/@larry_nguyen/how-to-import-data-into-google-firestore-2c31614c567c
Linear programming on streaming data within an Apache Beam pipeline
https://lakshmanok.medium.com/how-to-do-product-mix-optimization-in-real-time-d79ac1bf1c97
Wordle
https://medium.com/@inigosj/how-to-properly-play-wordle-using-dataflow-and-bigquery-825d2f4099ac
Apache Beam error handling with Kotlin and Asgarde
https://medium.com/@mazlum.tosun/error-handling-with-apache-beam-asgarde-with-kotlin-8b742fca120e
Splunk
Data Masking with Tokenization using Google Cloud DLP and Google Cloud Dataflow
Stopping
You cannot delete a Dataflow job; you can only stop it.
https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline