GCS DataFlow - ghdrako/doc_snipets GitHub Wiki
MillWheel started in 2008 led by Paul Nordstrom
dataflow is serverless
e Lambda Architecture, where users run parallel copies of a pipeline (one streaming, one batch) in order to have a "fast" copy of (often partial) results as well as a correct one. There are a number of drawbacks to this approach, including the obvious costs (computational and operational, as well as development costs) of running two systems instead of one. It's also important to note that Lambda Architectures often use systems with very different software ecosystems, making it challenging to replicate complex application logic across both. Finally, it's non-trivial to reconcile the outputs of the two pipelines at the end. This is a key problem that we've solved with Dataflow—users write their application logic once, and can choose whether they would like fast (but potentially incomplete) results, slow (but correct) results, or both.
To help demonstrate Dataflow's advantage over Lambda Architectures, let’s consider the use case of a large retailer with online and in-store sales. These retailers would benefit from in-store BI dashboards, used by in-store employees, that could show regional and global inventory to help shoppers find what they’re looking for, and to let the retailers know what’s been popular with their customers. The dashboards could also be used to drive inventory distribution decisions from a central or regional team. In a Lambda Architecture, these systems would likely have delays in updates that are corrected later by batch processes, but before those corrections are made, they could misrepresent availability for low-inventory items, particularly during high-volume times like the holidays. Poor results in retail can lead to bad customer experiences, but in other fields like cybersecurity, they can lead to complacency and ignored intrusion alerts. With Dataflow, this data would always be up-to-date, ensuring a better experience for customers by avoiding promises of inventory that’s not available—or in cybersecurity, an alerting system that can be trusted.
Google Cloud Dataflow, the fully-managed service for stream processing of data on Google Cloud Platform (GCP), provides exactly-once processing of every record.
The Apache Beam SDK is an open source programming model for data pipelines. You define these pipelines with an Apache Beam program and can choose a runner, such as Dataflow, to execute your pipeline.
Apache Beam is a portable data processing programming model. It's open source, and it can be ran in a highly distributed fashion. It's unified and that is single model, meaning your pipeline code can work for both batch and streaming data. You're not locked into a vendor because the code is portable and can work on multiple execution environments like Cloud Dataflow or Apache Spark among others. It's extensible and open source. You can browse and write your own connectors, and build transformation libraries too if you wanted. Apache Beam pipelines are written in Java, Python or Go. The SDK provides a host of libraries for transformations and existing data connectors to sources and sinks. Apache Beam creates a model representation of your code, which is portable across many runners. Runners pass off your model to an execution environment, which you could run in many different possible engines. Cloud Dataflow is one of the popular choices for running Apache Beam as an engine.
Cloud Dataflow service optimizes the execution graph of your model to remove inefficiencies. It schedules out work and the distributed fashion to workers and scales as needed. It will auto-heal in the event of faults with those workers. It will re-balance automatically to best utilize the underlying workers. And it'll connect to a variety of data syncs to produce a result. BigQuery is just one of the many syncs that you can output data to. What you won't see, by design, is all the compute and storage resources that Cloud Dataflow is managing for you elastically to fit the demand of your streaming data pipeline. That's fully managed. And even if you're a seasoned Java or Python developer, it's still a good idea to start from many of the existing Dataflow templates that cover common use cases across Google Cloud platform products.
beam-programming-model
A pipeline encapsulates the entire series of computations involved in reading input data, transforming that data, and writing output data. Apache Beam programs start by constructing a Pipeline object, and then using that object as the basis for creating the pipeline's datasets. Each pipeline represents a single, repeatable job.
A PCollection represents a potentially distributed, multi-element dataset that acts as the pipeline's data. Apache Beam transforms use PCollection objects as inputs and outputs for each step in your pipeline. A PCollection can hold a dataset of a fixed size or an unbounded dataset from a continuously updating data source.
A transform represents a processing operation that transforms data. A transform takes one or more PCollections as input, performs an operation that you specify on each element in that collection, and produces one or more PCollections as output.
ParDo is the core parallel processing operation in the Apache Beam SDKs, invoking a user-specified function on each of the elements of the input PCollection. ParDo collects the zero or more output elements into an output PCollection. The ParDo transform processes elements independently and possibly in parallel.
Apache Beam I/O connectors let you read data into your pipeline and write output data from your pipeline. An I/O connector consists of a source and a sink. All Apache Beam sources and sinks are transforms that let your pipeline work with data from several different data storage formats. You can also write a custom I/O connector.
Aggregation is the process of computing some value from multiple input elements. The primary computational pattern for aggregation in Apache Beam is to group all elements with a common key and window. Then, it combines each group of elements using an associative and commutative operation.
User-defined functions (UDFs)
Runners are the software that accepts a pipeline and executes it. Most runners are translators or adapters to massively parallel big-data processing systems. Other runners exist for local testing and debugging.
The time a data event occurs, determined by the timestamp on the data element itself. This contrasts with the time the actual data element gets processed at any stage in the pipeline.
Windowing enables grouping operations over unbounded collections by dividing the collection into windows of finite collections according to the timestamps of the individual elements. A windowing function tells the runner how to assign elements to an initial window, and how to merge windows of grouped elements. Apache Beam lets you define different kinds of windows or use the predefined windowing functions.
Apache Beam tracks a watermark, which is the system's notion of when all data in a certain window can be expected to have arrived in the pipeline. Apache Beam tracks a watermark because data is not guaranteed to arrive in a pipeline in time order or at predictable intervals. In addition, there are no guarantees that data events will appear in the pipeline in the same order that they were generated.
Triggers determine when to emit aggregated results as data arrives. For bounded data, results are emitted after all of the input has been processed. For unbounded data, results are emitted when the watermark passes the end of the window, indicating that the system believes all input data for that window has been processed. Apache Beam provides several predefined triggers and lets you combine them.
https://beam.apache.org/documentation/pipelines/create-your-pipeline/
- Creating Pipeline Object
// Start by defining the options for the pipeline.
PipelineOptions options = PipelineOptionsFactory.create();
// Then create the pipeline.
Pipeline p = Pipeline.create(options);
- Reading Data Into Your Pipeline
To create your pipeline’s initial PCollection, you apply a root transform to your pipeline object. A root transform creates a PCollection from either an external data source or some local data you specify. There are two kinds of root transforms in the Beam SDKs: Read and Create. Read transforms read data from an external source, such as a text file or a database table. Create transforms create a PCollection from an in-memory java.util.Collection.
The following example code shows how to apply a TextIO.Read root transform to read data from a text file. The transform is applied to a Pipeline object p, and returns a pipeline data set in the form of a PCollection:
PCollection lines = p.apply( "ReadLines", TextIO.read().from("gs://some/inputData.txt"));
- Applying Transforms to Process Pipeline Data You can manipulate your data using the various transforms provided in the Beam SDKs. To do this, you apply the transforms to your pipeline’s PCollection by calling the apply method on each PCollection that you want to process and passing the desired transform object as an argument.
The input is a PCollection called words; the code passes an instance of a PTransform object called ReverseWords to apply, and saves the return value as the PCollection called reversedWords.
PCollection words = ...; PCollection reversedWords = words.apply(new ReverseWords());
- Writing or Outputting Your Final Pipeline Data
PCollection<String> filteredWords = ...;
filteredWords.apply("WriteMyFile", TextIO.write().to("gs://some/outputData.txt"));
5) Running Your Pipeline
p.run(); // asynchronous
p.run().waitUntilFinish();
Google Examples