Module 2: ICP #5 - SnehaMishra28/BigData_Programming_Summer2018 GitHub Wiki
Team: 12
Professor: Yugyung Lee
Name: Sneha Mishra
Class ID: 11
Email: [email protected]
MyGitHub
Technical Partner:
Name: Aditya Soman
Class ID: 19
Email: [email protected]
GitHub
Objective
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.
Features
- Spark Streaming
- Log File Generation and running Results
- Spark word count program of Spark Streaming received from a data server listening on a TCP socket
- Use Netcat utility
Steps:
Part 1: Spark Streaming
Run files.py
Output:
Run streaming.py simultaneously
Running Results:
Part 2: Spark Streaming for TCP Socket
Part 3 (Bonus): Spark Streaming for Character Frequency using TCP Socket
References:
- https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html
- https://github.com/rsanchez-wsu/jfiles/wiki/Windows-10-Telnet-&-NetCat (For Netcat utility windows)
- https://www.edureka.co/blog/spark-streaming/
- https://www.go4expert.com/articles/netcat-t26082/
- https://www.quora.com/Is-netcat-on-default-Mac-OS-X-installs
- https://superuser.com/questions/115553/netcat-on-mac-os-x