Data Transfer and Processing Overview - yibinericxia/documents GitHub Wiki

Overview

Data processing is very common among the applications when data communication needs to happen in real-time or batch. Data can be in various format, such as short messages or event logs. Message Oriented Middleware (MOM) provides a solution via message broker.

There are 3 main processing approaches: message queue, steaming processing and batch jobs.

Message queue processing usually adopt producer/consumer pattern, and have one or more consumers and/or producers with the guarantee being that every message will only be delivered once. Once a message is delivered, it is gone unless a backup mechanism is implemented so that the message can be put back into the queue. Durability and persistency are its important features. Some typical implementations are ActiveMQ, RabbitMQ, etc.

Event streaming processing is normally real-time processing of continuous data flow for time critical events. It is different from the message queue processing in the way that it uses pub/sub model to process the data organized as log files or topics and output results in various formats. New subscribers can access the data from any point of time. The common examples are Apache Kafka and Pulsar.

Batch job processing does not necessarily requires real-time or near real-time processing. Batch jobs could be done in the scheduling fashion with data collected over time. The job processing could take time due to huge data or complex data analysis. The Spring Batch is a good example.

Popular approaches

Besides the offers from big 3 (AWS SQS, Microsoft Azure Service Bus, Google Cloud Pub/Sub, the following is the list of popular free approaches: