ICP 11: Apache Spark Streaming - acikgozmehmet/BigDataProgramming GitHub Wiki

ICP 11: Apache Spark Streaming

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. ... Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.

Objectives

  • Learn basics of Apache Spark Streaming
  • What is the need of streaming in Apache Spark,
  • Streaming in Spark architecture
  • How streaming works in Spark.
  • Understanding what are the Spark streaming sources and various Streaming Operations in Spark

What is Apache Spark Streaming?

A data stream is an unbounded sequence of data arriving continuously. Streaming divides continuously flowing input data into discrete units for further processing. Stream processing is low latency processing and analyzing of streaming data.

Spark Streaming was added to Apache Spark in 2013, an extension of the core Spark API that provides scalable, high-throughput and fault-tolerant stream processing of live data streams. Data ingestion can be done from many sources like Kafka, Apache Flume, Amazon Kinesis or TCP sockets and processing can be done using complex algorithms that are expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases and live dashboards.

Its internal working is as follows. Live input data streams is received and divided into batches by Spark streaming, these batches are then processed by the Spark engine to generate the final stream of results in batches.

Its key abstraction is Apache Spark Discretized Stream or, in short, a Spark DStream, which represents a stream of data divided into small batches. DStreams are built on Spark RDDs, Spark’s core data abstraction. This allows Streaming in Spark to seamlessly integrate with any other Apache Spark components like Spark MLlib and Spark SQL.

In class exercise:

1. Spark Streaming using Log File Generator

We are using a log generator python script which reads some lines of a file (lorem.txt) and writes to a separate log file in 'log' folder.

Please click on the link to reach to the source code

Then each log file is streamed into Apache Spark code with streaming.py.

Please click on the link to reach to the source code

Finally, Wordcount algorithm is implemented on each line of every log file.

2. Spark Streaming for TCP Socket

a. Spark Streaming for TCP Socket Using NetCat

Please click on the link to reach to the source code

b. Spark Streaming for TCP Socket Using File

Please click on the link to reach to the source code

3. Spark Streaming for Character Frequency using TCP Socket (Bonus)

Please click on the link to reach the source code

References:

https://spark.apache.org/docs/latest/streaming-programming-guide.html

https://data-flair.training/blogs/apache-spark-streaming-tutorial/

https://www.edureka.co/blog/spark-streaming/