ICP_Module2_Assignment_4 - MadhuriSarode/BDP GitHub Wiki
Madhuri Sarode : 24
Spark Streaming and Data Analysis
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards.
Part 1 : Spark Streaming using Log File Generator
Write the text files into log directory subsequently getting streamed by the other process.
code to generate the log file creation. There will be 5 seconds delay between creation of files.
Input : The input is given in console using netstat on port 9999
Output : We can see the word count
Part 2:Spark Streaming for TCP Socket:
Code for listening using TCP socket the words and the word count logic is implemented
Input : The input is given in console using netstat in TCP port 9999
Output : The output shows that the input words are streamed from console and is counted
Bonus:Spark Streaming for Character Frequency using TCP Socket.
Code
Input : The input is given in console using netstat in TCP port 9999
Output : The output shows that the input words are streamed from console and is counted