Apache Flink vs Apache Spark - sambos/Architectures GitHub Wiki

Apache Flink vs Apache Spark

Apache Spark has been a leader (and still) in batch/micro batch (Near realtime) processing work loads, it provides DStream abstraction to micro batches of RDD data -- but its not a complete stream representation of data in real sense. In Spark Streaming is special kind of processing on top of batch.

  • RDD are Fault-Tolerant and can reconstruct the state after failure

Apache Flink implements actual stream processing from groud up. For Flink Batch is a special kind of processing on top of streaming (does not use micro batching)

  • Ideal for real streaming applications (complex stream processing)
  • Also has a custom memory management - see Flink GC management using Bits & Bytes
  • Lower latency and higher througput
  • Windowing - More powerful set of window operations copared to Spark
  • Exactly-Once processing guarantees [there is a switch to downgrade the guarantees to at-least-once]
  • Provides Fault-Tolerance by takeing consistent snapshots of the distributed data stream and operator state.
    • Snapshots are checkpoints that flink can fallback to in case of failure