Data Processing - dtoinagn/flyingbird.github.io GitHub Wiki
Data is the foundation of many modern applications, from fraud detection to real-time analytics. Processing data efficiently is critical, and there are three primary approaches.
Batch Processing
- Processes data in chunks (batches) at scheduled intervals
- Efficient for large volumes of data that don't need real-time updates
- Common tools: Apache Spark, AWS Glue, Google Dataflow (batch mode)
- Best for: Data warehouse, reporting, historical analysis.
- Example: A retail company runs a daily batch job to update sales reports at midnight
Steaming Processing
- Processes data continuously as it arrives
- Ideal for real-time applications that require immediate insight
- Common tools: Apache Kafka, Apache Flink, Apache Pulsar, Spark Streaming, AWS Kinesis Data Streams, AWS Kinesis Firehose
- Best for: Fraud detection, real-time dashboards, IoT monitoring
- Example: A bank detects fraudulent transactions in real time and blocks them instantly
Hybrid Processing
- Combines batch and streaming processing to get the best of both worlds
- Ideal for complex systems that need both historical analysis and real-time insight
- Common tools: Lambda Architecture, Kappa Architecture
- Example: A financial service company uses a hybrid system to detect fraud in real-time and generate daily reports