Dataflow - bobbae/gcp GitHub Wiki

Dataflow is a managed service for executing a wide variety of data processing patterns.

Google Cloud Dataflow makes it easy to process and analyze real-time streaming data so that you can derive insights and react to new information in real-time.

https://www.youtube.com/watch?v=7lJyq1hw_KI

Dataflow uses your pipeline code to create an execution graph that represents your pipeline's PCollections and transforms.

https://www.youtube.com/watch?v=cqDBnOaS6O4

Apache Beam

The Apache Beam SDK is an open source programming model that enables you to develop both batch and streaming pipelines.

You create your pipelines with an Apache Beam program and then run them on the Dataflow service.

The Apache Beam documentation provides in-depth conceptual information and reference material for the Apache Beam programming model, SDKs, and other runners.

Alternatives

Cloud Dataflow is not the first big data processing engine. You can use Apache Spark in Google Cloud Dataproc Service, Cloud Composer based on Airflow and Cloud Workflows.

https://github.com/manuzhang/awesome-streaming

Why Dataflow?

Serverless: We don’t have to manage computing resources
Processing code is separate from the execution environment
Processing batch and stream mode with the same programming model

https://beam.apache.org/documentation/programming-guide/

Firestore native connector

https://cloud.google.com/blog/products/databases/apache-beam-firestore-connector-released

Getting Started Tutorial

https://medium.com/bb-tutorials-and-thoughts/how-to-get-started-with-gcp-dataflow-822295dce7b4

Data pipeline

https://doppelfelix.medium.com/pipeline-in-the-cloud-6edb007c4d52

Tag Manager

https://cloud.google.com/blog/products/data-analytics/learn-beam-patterns-with-clickstream-processing-of-google-tag-manager-data

Python support

https://cloud.google.com/blog/products/data-analytics/debunking-myths-about-python-on-dataflow

Examples

Word Count Example

https://beam.apache.org/get-started/wordcount-example/

Dataflow pipeline example: PubSub to GCS

https://jtaras.medium.com/building-a-simple-google-cloud-dataflow-pipeline-pubsub-to-google-cloud-storage-9bbf170e8bad

Preprocessing data

https://doppelfelix.medium.com/using-apache-beam-to-automate-your-preprocessing-in-data-science-144a89392f15

Dataflow Mobile Gaming Example

https://beam.apache.org/get-started/mobile-gaming-example/

Streaming into BigQuery using dataflow

https://medium.com/antvoice-tech/how-we-are-streaming-thousands-of-rows-per-second-into-bigquery-part-i-google-cloud-dataflow-9465fdcd436d

Example Apache Beam Notebook

In this notebook, we set up your development environment and work through a simple example using the DirectRunner. You can explore other runners with the Beam Capability Matrix.

Creating a pipeline using Apache Beam interactive runner

Using the Apache Beam interactive runner with JupyterLab notebooks lets you iteratively develop pipelines, inspect your pipeline graph, and parse individual PCollections in a read-eval-print-loop (REPL) workflow. These Apache Beam notebooks are made available through AI Platform Notebooks.

Interactive Pipeline development using Apache Beam notebooks

https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development

Interactive Beam Pipeline Notebook ML inference at scale

https://cloud.google.com/blog/products/data-analytics/interactive-beam-pipeline-ml-inference-at-scale-in-notebook/

Dataflow Templates

The Dataflow templates are an effort to solve simple, but large, in-Cloud data tasks, including data import/export/backup/restore and bulk API operations, without a development environment. The technology under the hood which makes these operations possible is the Google Cloud Dataflow service combined with a set of Apache Beam SDK templated pipelines.

https://cloud.google.com/blog/products/data-analytics/dataflow-templates-gets-your-data-into-motion

Using provided Dataflow templates

Google provides a set of open-source Dataflow templates.

Creating Dataflow templates

Dataflow templates use runtime parameters to accept values that are only available during pipeline execution. To customize the execution of a templated pipeline, you can pass these parameters to functions that run within the pipeline (such as a DoFn).

To create a template from your Apache Beam pipeline, you must modify your pipeline code to support runtime parameters.

https://cloud.google.com/dataflow/docs/guides/templates/creating-templates

Running Dataflow templates

https://cloud.google.com/dataflow/docs/guides/templates/running-templates

Templates for Elastic Stack

https://cloud.google.com/blog/products/data-analytics/dataflow-templates-for-elastic-cloud

UDF

A UDF is a JavaScript snippet that implements a simple element processing logic, and is provided as an input parameter to the Dataflow pipeline. The UDF JavaScript code runs on Nashorn JavaScript engine included in the Dataflow worker’s Java runtime (applicable for Java pipelines such as Google-provided Dataflow templates). The code is invoked locally by a Dataflow worker for each element separately. Element payloads are serialized and passed as JSON strings back and forth.