Big_Data_Programming_ICP_4_Module2 - kusamdinesh/Big-Data-and-Hadoop GitHub Wiki

Spark Streaming

Aim :

  • Spark Streaming using Log File Generator.
  • Write a spark word count program of Spark Streaming received from a data server listening on a TCP socket.
  • Spark Streaming for Character Frequency using TCP Socket.

Task 1 : Spark Streaming using Log File Generator.

Here, lorem.txt is considered as input and file.py is used for generating and creating log files from this text file.

Input :

Output :

Initially execute the streaming.py file and then execute the file.py which generates the log files and saves them in the log folder.

Task 2: Spark Streaming for TCP Socket.

Input :

Output :

Initially, start the port 6000 using the netcat command in cmd and then execute the wordcount.py program.

Then, any lines typed in the terminal running the netcat server will be counted and printed as output every second.

Bonus: Spark Streaming for Character Frequency using TCP Socket.

This is similar to the previous part of this ICP, the only difference is that in the first part we count the words, but here we count the individual characters.

Input :

Output:

Initially, start the port 6000 using the netcat command in cmd and then execute the characterfreq.py program.

Then, any lines typed in the terminal running the netcat server will be counted and printed as output every second.

References :

https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html