Data Processing - dtoinagn/flyingbird.github.io GitHub Wiki

Data is the foundation of many modern applications, from fraud detection to real-time analytics. Processing data efficiently is critical, and there are three primary approaches.

Batch Processing

  • Processes data in chunks (batches) at scheduled intervals
  • Efficient for large volumes of data that don't need real-time updates
  • Common tools: Apache Spark, AWS Glue, Google Dataflow (batch mode)
  • Best for: Data warehouse, reporting, historical analysis.
  • Example: A retail company runs a daily batch job to update sales reports at midnight

Steaming Processing

  • Processes data continuously as it arrives
  • Ideal for real-time applications that require immediate insight
  • Common tools: Apache Kafka, Apache Flink, Apache Pulsar, Spark Streaming, AWS Kinesis Data Streams, AWS Kinesis Firehose
  • Best for: Fraud detection, real-time dashboards, IoT monitoring
  • Example: A bank detects fraudulent transactions in real time and blocks them instantly

Hybrid Processing

  • Combines batch and streaming processing to get the best of both worlds
  • Ideal for complex systems that need both historical analysis and real-time insight
  • Common tools: Lambda Architecture, Kappa Architecture
  • Example: A financial service company uses a hybrid system to detect fraud in real-time and generate daily reports