Module 2: ICP #5 - SnehaMishra28/BigData_Programming_Summer2018 GitHub Wiki

Team: 12
Professor: Yugyung Lee

Name: Sneha Mishra
Class ID: 11
Email: [email protected]
MyGitHub

Technical Partner:
Name: Aditya Soman
Class ID: 19
Email: [email protected]
GitHub

Objective

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.

Features

  1. Spark Streaming
  2. Log File Generation and running Results
  3. Spark word count program of Spark Streaming received from a data server listening on a TCP socket
  4. Use Netcat utility

Steps:

Part 1: Spark Streaming

Run files.py

Output:

Run streaming.py simultaneously

Running Results:

Part 2: Spark Streaming for TCP Socket

Part 3 (Bonus): Spark Streaming for Character Frequency using TCP Socket

References:

  1. https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html
  2. https://github.com/rsanchez-wsu/jfiles/wiki/Windows-10-Telnet-&-NetCat (For Netcat utility windows)
  3. https://www.edureka.co/blog/spark-streaming/
  4. https://www.go4expert.com/articles/netcat-t26082/
  5. https://www.quora.com/Is-netcat-on-default-Mac-OS-X-installs
  6. https://superuser.com/questions/115553/netcat-on-mac-os-x