Project Proposal - neerajpadarthi/Big-Data-Programming GitHub Wiki

Project Tile: Spark ETL and Sentiment Analysis

Team Members:

  • Neeraj Padarthi - 19
  • Hiresh Jakkala - 11
  • Hari Y - 29

Objective

In this data-driven world, handling data has become vital in the decision-making process in many industries such as Telecom, Banking, Financial and Health sector servicing industries. Managing the sheer volumes of data and getting insights from it would be the main factors. One of the amazing frameworks that can handle big data in real-time and perform different analysis, is Apache Spark.

Our Project's main idea is to do the ETL process using Spark Streaming and implementing the machine learning concepts on this real-time data. The source of our system is Twitter data and we would be using Streaming Content which is real-time processing of data, by using streaming API we would be collecting the data in a near real-time process for a set of defined keywords. Then we would be performing the transformations on the streaming set of RDD’s and load the data into the Hive system which is similar to basic ETL process. Also, we would be performing the EDA on twitter data while capturing the context of the data. Our project would also highlight the Sentiment Analysis System where we populate real-time sentiments for the tweets. It also identifies the major keyword factors for a tweet to be categorized into positive or negative sentiment.