ICP11 - manaswinivedula/Big-Data-Programming GitHub Wiki
Task 1 Spark streaming using log file generator
-
First we will run file.py. It will create 8 log.txt files from the lorem.txt to the log directory.
-
Each file will be created in a duration of 5 seconds.
-
After files being generated then run stream.py to get the wordcount output of word count on the console.
The following is the code of file.py
The following is the code for stream.py
The following is the input lorem.txt file
The following is the one of input log files.
The following is the output of file.py
The following is the output for wordcount of log files
Task 2 Spark streaming for TCP socket
-
Here, first we are creating a streaming context with 2 threads, batch interval 5.
-
A SPark socket stream is created which connects to localhost and port number 4000.
-
The binded Socket will wait to listen to the input from the port 4000.
-
Here, lines are divided into words and then word count is performed in each batch and the corresponding output is printed in the terminal after entering the input to the port 4000.
-
First we will execute the python file wordcount.py.
-
Then simultaneously in the command prompt we should run the command "nc -lp 4000" and then we enter input.
-
Now the word count will be performed on the input which we enter and the corresponding output will be displayed on the console.
The following is the code of the word count using TCP socket
The following is the output entered in the command prompt
The following is the resultant output is displayed on the console.
Bonus Spark streaming for character frequency using TCP socket
-
Here, first we are creating a streaming context with 2 threads, batch interval 5.
-
A SPark socket stream is created which connects to localhost and port number 7000.
-
The binded Socket will wait to listen to the input from the port 7000.
-
Here, lines are divided into words and Then it divides each line into a set of words and then calculates the word length of each word in each batch and the corresponding output is printed in the terminal after entering the input to the port 7000.
-
First we will execute the python file frequency.py.
-
Then simultaneously in the command prompt we should run the command "nc -lp 7000" and then we enter input.
-
Now the word count will be performed on the input which we enter and the corresponding output will be displayed on the console.
The following is the code of the character frequency count using TCP socket
The following is the output entered in the command prompt.
The following is the resultant output is displayed on the console.