Data Processing - dtoinagn/flyingbird.github.io GitHub Wiki

Data is the foundation of many modern applications, from fraud detection to real-time analytics. Processing data efficiently is critical, and there are three primary approaches.

Batch Processing

Processes data in chunks (batches) at scheduled intervals
Efficient for large volumes of data that don't need real-time updates
Common tools: Apache Spark, AWS Glue, Google Dataflow (batch mode)
Best for: Data warehouse, reporting, historical analysis.
Example: A retail company runs a daily batch job to update sales reports at midnight

Steaming Processing

Processes data continuously as it arrives
Ideal for real-time applications that require immediate insight
Common tools: Apache Kafka, Apache Flink, Apache Pulsar, Spark Streaming, AWS Kinesis Data Streams, AWS Kinesis Firehose
Best for: Fraud detection, real-time dashboards, IoT monitoring
Example: A bank detects fraudulent transactions in real time and blocks them instantly

Hybrid Processing

Combines batch and streaming processing to get the best of both worlds
Ideal for complex systems that need both historical analysis and real-time insight
Common tools: Lambda Architecture, Kappa Architecture
Example: A financial service company uses a hybrid system to detect fraud in real-time and generate daily reports