ICP 11 - Murarishetti-Shiva-Kumar/Big-Data-Programming GitHub Wiki
Lesson Plan11: Apache Spark Streaming
1. Spark Streaming using Log File Generator:
Take a text file to generate log files.
create the log folder in the current path where the sourcecode is present. Then run file.py file. This program will create log files in the log folder created. Here it creates 30 log files one after the other with a time interval of 5 seconds and the content from the file lorem.txt is stored in these log files.
Simultaneously run the streaming.py file which takes the generated log files and performs word count on the log file created by splitting each line with delimiter space and then map with each word with 1 and uses the reduce by key to count the words.
Example:
Say if the line is
hi this is shiva hi
Then the output at the mapper is
(hi,1) (this,1) (is,1) (shiva,1) (hi,1)
Then the output at the reducer is
(hi,2) (this,1) (is,1) (shiva,1)
Output after streaming looks like
2. Spark Streaming for TCP Socket:
Write a spark word count program of Spark Streaming received from a data server listening on a TCP socket
Install netcat package from the link provided and nc would be password while extracting. https://joncraton.org/files/nc111nt.zip
Put all the files extracted into a folder and set path in system variables
Now open the command prompt and type nc -l -p 8090 ( the port number can be any port of our choice)
We are creating streaming context with 2 threads with batch interval of 5. A DStream is created which connects to hostname:port number. Each line is divided into words and then word count is performed in each batch and the corresponding output is returned. Execute wordcount.py and then simultaneously in the command prompt we should run the command "nc -l -p port number" and then provide the input. Now the word count will be performed on the input which we entered and the corresponding output will be displayed.
Mention the same port number in the wordcount.py. Inside the file please provide the path of the spark home. Install the findspark using pip install findspark or using auto import.
output is generated after mapping and reducing phase.
3. Spark Streaming for Character Frequency using TCP Socket:
We are creating streaming context with 2 threads with batch interval of 5. A DStream is created which connects to hostname:port number. Each line is divided into words and then word count is performed in each batch and the corresponding output is returned. Execute wordcount.py and then simultaneously in the command prompt we should run the command "nc -l -p port number" and then provide the input. Now the word count will be performed on the input which we entered and the corresponding output will be displayed.
Here data that is given as the input in the command prompt is splitted into lines which will again be split into words separated by space. Now each word will be mapped with its length and the reducer groups each group of words with the same length and gives the output.
Example:
If the input is
hi i am shiva
The output at the mapper is
(2,hi) //here is 2 the length of the word hi (1,i) (2,am) (5,shiva)
The output at the reducer is
(2, hi,am) //here hi and am has the same length (1,i) (5,shiva)
Output looks like