Stream Processing Summary - rFronteddu/general_wiki GitHub Wiki
In some way, stream processing is similar to batch processing but done continuously on unbounded (never ending) streams rather than on a fixed-sized input. From this perspective, message brokers and event logs serve as the streaming equivalent of a filesystem.
There are at least two types of message brokers:
- AMQP/JMS-style message brokers:
- A broker assigns individual messages to consumers, and consumers acknowledge individual messages when they have been successfully processed.
- Messages are deleted from the broker once they have been acknowledged.
- Appropriate as an asynchronous form of RPC (also called Message-Passing Data-Flow), for example in a task queue, where the exact order of message processing is not important and where there is no need to go back and read old messages again after they have been processed.
- Log-based message brokers:
- The broker assigns all messages in a partition to the same consumer node, and always delivers messages in the same order.
- Parallelism is achieved through partitioning, and consumers track their progress by checkpointing the offset of the last message they have processed.
- The broker retain messages on disk, so it is possible to jump back and reread old messages if necessary.
Log-based approaches have similarities to replication logs found in databases and log-structured storage engines. This approach is good for stream processing systems that consume input streams and generate derived state or derived output streams.
A stream can come from many sources, it can also be useful to think of the writes to a db as a stream:
- we can capture the changelog (history of all changes) either implicitly through change data capture or explicitly through event sourcing. Log compaction allows the stream to retain a full copy of the content of a db.
Representing dbs as streams opens up opportunities.
- You can keep derived data systems such as search indexes, caches, and analytics systems updated by consuming the log of changes and applying them to the derived system.
- You can build fresh views onto existing data by starting from scratch and consuming the log of changes from the beginning all the way to the present
The facilities to maintaining state as streams and replaying messages are also the basis for the techniques that enable stream joins and fault tolerance in various stream processing frameworks.
- We can search for event patterns (complex event processing)
- Computing windowed aggregations (stream analytics)
- and keeping derived data systems up to date (materialized views)
We can also consider several types of joins that may appear in stream processes:
- Stream-stream joins: Join operator searches for related events that occur within some window of time in one or more streams.
- Stream-table joins: Stream + db changelog: the changelog keeps a local copy of the db up to date, for each activity event, the join queries the db and outputs an enriched activity event.
- Table-table joins: db changelog + db changelog: every change on one side is joined with the latest state of the other side. The result is a stream of changes to the materialized view of the join between the two tables.
Finally, to achieve fault tolerance and exactly-once semantics in stream processors, we need to discard partial outputs of failed tasks, however, since stream are long-running and produce output continuously, we need finer-grained recovery mechanism based on micro batching, checkpointing, transactions, or idempotent writes.