ICP 11: Apache Spark Streaming - acikgozmehmet/BigDataProgramming GitHub Wiki

ICP 11: Apache Spark Streaming

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. ... Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.

Objectives

Learn basics of Apache Spark Streaming
What is the need of streaming in Apache Spark,
Streaming in Spark architecture
How streaming works in Spark.
Understanding what are the Spark streaming sources and various Streaming Operations in Spark

What is Apache Spark Streaming?

A data stream is an unbounded sequence of data arriving continuously. Streaming divides continuously flowing input data into discrete units for further processing. Stream processing is low latency processing and analyzing of streaming data.

Spark Streaming was added to Apache Spark in 2013, an extension of the core Spark API that provides scalable, high-throughput and fault-tolerant stream processing of live data streams. Data ingestion can be done from many sources like Kafka, Apache Flume, Amazon Kinesis or TCP sockets and processing can be done using complex algorithms that are expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases and live dashboards.

Its internal working is as follows. Live input data streams is received and divided into batches by Spark streaming, these batches are then processed by the Spark engine to generate the final stream of results in batches.

Its key abstraction is Apache Spark Discretized Stream or, in short, a Spark DStream, which represents a stream of data divided into small batches. DStreams are built on Spark RDDs, Spark’s core data abstraction. This allows Streaming in Spark to seamlessly integrate with any other Apache Spark components like Spark MLlib and Spark SQL.