ICP 4 module II - gracesyl/big-data-hadoop GitHub Wiki

Data Streaming: Data Streaming is a technique for transferring data so that it can be processed as a steady and continuous stream. Streaming technologies are becoming increasingly important with the growth of the Internet.

Data Streaming Features: Scaling: Spark Streaming can easily scale to hundreds of nodes. Speed: It achieves low latency. Fault Tolerance: Spark has the ability to efficiently recover from failures. Integration: Spark integrates with batch and real-time processing. Business Analysis: Spark Streaming is used to track the behavior of customers which can be used in business analysis InClass Exercise: 1.Spark Streaming using Log File Generator:

Spark Streaming using log file generator. Use the instructions in the slides.

Loggenerator I/P:

Loggenerator O/P:

Sparkstreaming using the generated logfiles in log directory:

O/p:

2.Spark Streaming for TCP Socket:

Write a spark word count program of Spark Streaming received from a data server listening on a TCP socket. Hint: For Netcat utility in Windows https://github.com/rsanchez-wsu/jfiles/wiki/Windows-10-Telnet-&-NetCat

Localhost streaming i/p:

O/p:

Limitations:

pycharm directory setting was difficult.

Video Link :https://drive.google.com/file/d/1RnVFN8VuP_PdSR3Y2iv7DBlbEmCiOZXP/view?usp=sharing