Spark Streaming 1 - praveenpoluri/Big-Data-Programing GitHub Wiki

Aim:

To use Spark Streaming API for data processing continuously and to apply spark streaming using real time examples.

Introduction:

A data stream is an unbounded sequence of data arriving continuously. Streaming divides continuously flowing input data into discrete units for further processing. Stream processing is low latency processing and analyzing of streaming data.

Spark Streaming was added to Apache Spark in 2013, an extension of the core Spark API that provides scalable, high-throughput and fault-tolerant stream processing of live data streams. Data ingestion can be done from many sources like Kafka, Apache Flume, Amazon Kinesis or TCP sockets and processing can be done using complex algorithms that are expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases and live dashboards.

Its internal working is as follows. Live input data streams is received and divided into batches by Spark streaming, these batches are then processed by the Spark engine to generate the final stream of results in batches.

Its key abstraction is Apache Spark Discretized Stream or, in short, a Spark DStream, which represents a stream of data divided into small batches. DStreams are built on Spark RDDs, Spark’s core data abstraction. This allows Streaming in Spark to seamlessly integrate with any other Apache Spark components like Spark MLlib and Spark SQL.

To process the data, most traditional stream processing systems are designed with a continuous operator model, which works as follows:

Streaming data is received from data sources (e.g. live logs, system telemetry data, IoT device data, etc.) into some data * ingestion system like Apache Kafka, Amazon Kinesis, etc.

  • The data is then processed in parallel on a cluster.
  • Results are given to downstream systems like HBase, Cassandra, Kafka, etc.

There is a set of worker nodes, each of which runs one or more continuous operators. Each continuous operator processes the streaming data one record at a time and forwards the records to other operators in the pipeline.

Data is received from ingestion systems via Source operators and given as output to downstream systems via sink operators.

Features:

a) Fast Failure and Straggler Recovery

In real time, the system must be able to fastly and automatically recover from failures and stragglers to provide results which is challenging in traditional systems due to the static allocation of continuous operators to worker nodes.

b) Load Balancing

In a continuous operator system, uneven allocation of the processing load between the workers can cause bottlenecks. The system needs to be able to dynamically adapt the resource allocation based on the workload.

c) Unification of Streaming, Batch and Interactive Workloads

In many use cases, it is also attractive to query the streaming data interactively, or to combine it with static datasets (e.g. pre-computed models). This is hard in continuous operator systems which does not designed to new operators for ad-hoc queries. This requires a single engine that can combine batch, streaming and interactive queries.

d) Advanced Analytics with Machine learning and SQL Queries

Complex workloads require continuously learning and updating data models, or even querying the streaming data with SQL queries. Having a common abstraction across these analytic tasks makes the developer’s job much easier.

How Spark Streaming works:

In Spark Streaming divide the data stream into batches called DStreams, which internally is a sequence of RDDs. The RDDs process using Spark APIs, and the results return in batches. Spark Streaming provides an API in Scala, Java, and Python. The Python API recently introduce in Spark 1.2 and still lacks many features. Spark Streaming maintains a state based on data coming in a stream and it call as stateful computations. It also allows window operations (i.e., allows the developer to specify a time frame to perform operations on the data that flows in that time window). There is a sliding interval in the window, which is the time interval of updating the window.

Spark Streaming sources:

Every input DStream (except file stream) associate with a Receiver object which receives the data from a source and stores it in Spark’s memory for processing. There are two categories of built-in streaming sources:

Basic sources – These are the sources directly available in the StreamingContext API. Examples: file systems, and socket connections. Advanced sources – Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. These require linking against extra dependencies. There are two types of receivers base on their reliability:

Reliable Receiver – A reliable receiver is the one that correctly sends an acknowledgment to a source when the data receives and stores in Spark with replication. Unreliable Receiver – An unreliable receiver does not send an acknowledgment to a source. This we can use for sources when one does not want or need to go into the complexity of acknowledgment.

Spark Streaming operators:

i. Transformation Operations in Spark Similar to Spark RDDs, Spark transformations allow modification of the data from the input DStream. DStreams support many transformations that are available on normal Spark RDD’s. Some of the common ones are as follows. map(), flatMap(), filter(), repartition(numPartitions), union(otherStream), count(), reduce(), countByValue(), reduceByKey(func, [numTasks]), join(otherStream, [numTasks]), cogroup(otherStream, [numTasks]), transform(), updateStateByKey(), Window()

ii. Output Operations in Apache Spark DStream’s data push out to external systems like a database or file systems using Output Operations. Since external systems consume the transformed data as allowed by the output operations, they trigger the actual execution of all the DStream transformations. Currently, the following output operations define as: print(), saveAsTextFiles(prefix, [suffix])”prefix-TIME_IN_MS[.suffix]”, saveAsObjectFiles(prefix, [suffix]), saveAsHadoopFiles(prefix, [suffix]), foreachRDD(func) Hence, DStreams like RDDs execute lazily by the output operations. Specifically, the received data is processed forcefully by RDD actions inside the DStream output operations. By default, output operations execute one-at-a-time. And they executes in the order they are define in the Spark applications.

Tools:

  • PyCharm
  • IntellijIDE
  • Spark
  • Python
  • Scala
  • NetCat

Implementation:

Task 1 :

Spark Streaming using log file generator:

  • Here we are using random generator to generate log files.

  • We are using netcat to stream data using streaming context.

  • We are using socket 8085 to connect to spark streaming context.

  • Here we are using Spark streaming to calculate word frequency in streamed text.

Limitations:

  • No File Management System. There is no file management system in Apache Spark, need to be integrated with other
  • platforms.
  • No Real-Time Data Processing.
  • Expensive.
  • Small Files Issue.
  • Latency.
  • The lesser number of Algorithms.
  • Iterative Processing.
  • Window Criteria.

Conclusion:

Applied Spark Streaming in various context and also calculated letter frequency in a document using Streaming context.

References: