GCP Dataflow - dennisholee/notes GitHub Wiki

Upgrade packages and install Apache Beam

sudo apt-get install python-pip -y

pip install apache-beam[gcp] oauth2client==3.0.0
# downgrade as 1.11 breaks apitools
sudo pip install --force six==1.10
pip install -U pip
pip -V
sudo pip install apache_beam

DirectRunner

--DataFlowRunner

Concepts

  • Event time, which is the time at which events actually occurred.

  • Processing time, which is the time at which events are observed in the system.

  • Windowing (i.e., partitioning a data set along temporal boundaries), which is a common approach used to cope with the fact that unbounded data sources technically may never end

  • Triggers: A trigger is a mechanism for declaring when the output for a window should be materialized relative to some external signal

  • Accumulation: An accumulation mode specifies the relationship between multiple results that are observed for the same window.

The Google Dataflow model encourages users to ask four questions in order to understand the approach required when processing data:

  1. What are you computing?
  2. Where in event time?
  3. When in processing time?
  4. How do refinements relate? https://www.infoq.com/articles/dataflow-apache-beam

What results are calculated? This question is answered by the types of transformations within the pipeline. This includes things like computing sums, building histograms, training machine learning models, etc. It’s also essentially the question answered by classic batch processing.

Where in event time are results calculated? This question is answered by the use of event-time windowing within the pipeline. This includes the common examples of windowing from Streaming 101 (fixed, sliding, and sessions), use cases that seem to have no notion of windowing (e.g., time-agnostic processing as described in Streaming 101; classic batch processing also generally falls into this category), and other, more complex types of windowing, such as time-limited auctions. Also note that it can include processing-time windowing as well, if one assigns ingress times as event times for records as they arrive at the system.

When in processing time are results materialized? This question is answered by the use of watermarks and triggers. There are infinite variations on this theme, but the most common pattern uses the watermark to delineate when input for a given window is complete, with triggers allowing the specification of early results (for speculative, partial results emitted before the window is complete) and late results (for cases where the watermark is only an estimate of completeness, and more input data may arrive after the watermark claims the input for a given window is complete).

How do refinements of results relate? This question is answered by the type of accumulation used: discarding (where results are all independent and distinct), accumulating (where later results build upon prior ones), or accumulating and retracting (where both the accumulating value plus a retraction for the previously triggered value(s) are emitted).

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Types of Windows

  • Fixed
  • Sliding
  • Data dependent

Out-of-Order Processing (OOP)

Watermark

  • Watermark is a threshold to specify how long the system waits for late events.
  • Constrain the system to prevent aggregates to be kept indefinitely, which causes the memory usage to grow to indefinitely
  • Address the issue of late-arriving data.
  • When the data-processing system receives a watermark timestamp, it assumes that it is not going to see any message older than that timestamp.

Low Watermark Src: https://www.youtube.com/watch?v=TWxSLmkWPm4

  • Refer to the oldest work not complete i.e. "Oldest unprocessed data" point

  • Input Low Watermark: oldest work not yet sent to the streaming stage

  • Output Low Watermark: oldest work not yet completed by the streaming stage

Apache Beam

  • Pipeline - encapsulates your entire data processing task, from start to finish
  • PCollection - represents a distributed data set that your Beam pipeline operates on
  • PTransform - epresents a data processing operation, or a step, in your pipeline
  • ParDO - considers each element in the input PCollection, performs some processing function on that element, and emits zero, one, or multiple elements