MODULE 2 ICP 4 - a190884810/Big-Data-Programming GitHub Wiki

PROBLEM STATEMENT

  1. Spark Streaming using log file generator.
  2. Write a spark word count program of Spark Streaming received from a data server listening on a TCP socket.

FEATURES

  • For this in-class programming, PySpark, A collaboration of Apache Spark and Python is used. Netcat is installed. Environmnet variables( System Variables) are set and 'nc' is typed in the command prompt to make sure netcat runs successfully. Pycharm IDE is used for the purpose of running the codes.

CONFIGURATIONS

  1. Python(Version 2.7) has been installed and variables are set.
  2. Spark is installed and environment variables has been configured.
  3. winutils is installed and is setup.
  4. Finally, pip is installed and all the necessary packages are configured in Pycharm IDE.

APPROACH

  • Spark Streaming using log file generator. To perform data stream using log file generator, Two python codes are written and executed. One is to generate log files dynamically and the other one to perform actual streaming of data. Following screenshot depicts the code written for dynamic log file generator as well the output which results in a series of log files.

  • The approach used here is very simple. A text file(In this case, The given file lorem.txt) is considered and is read. Now, the contents of this data is dynamically written into 30 log files that are generated upon the execution of the code. New log file is generated every five seconds. A writefile() function is used for writing the data of lorem.txt text file into dynamically generated log files.

  • Right after file.py is executed, Another python File, streaming.py needs to be executed. The following screenshots depicts the code as well as the output resulted upon running this streaming.py code.

  • In this streaming code, environment variables paths of Spark and winutils are specified. A streamingcontext is used and a numerical number is specified which illustrates the time duration in which streaming is done. Directory name(In this case, log) is specified on which streaming needs to be performed. Finally, Map and reducebykey operations are performed. These map function splits the input data into a set of words and with reducebykey operation, WordCount is also performed on the log files that are already generated.

  • Write a spark word count program of Spark Streaming received from a data server listening on a TCP socket. The spark word count program is run in PySpark, The input on which word count is performed is given in the command prompt followed by the command nc–l –p 5000 in the CLI. Here Spark Context is set and streaming context is set as well. Here Streaming of data will be performed with an interval of 5 seconds in between. Data stream will be created and will be connected to the host as well as the port number that is given. Finally, Splitting of the data is done into words and word count is performed using map and reduceByKey() functions. Below screenshots depicts the snippets of code as well as usage of netcat and the words that have been inputted via the CLI.

REFERENCES