Spark Word Count - praveenpoluri/Big-Data-Programing GitHub Wiki
Spark
Aim:
To apply Transformations and Actions on input text file in MapReduce paradigm. To apply secondary sort on input text, operations in reduce phase.
Introduction :
Apache Spark is an open-source cluster computing system that provides high-level API in Java, Scala, Python and R. It can access data from HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source. And run in Standalone, YARN and Mesos cluster manager. Apache Spark is a general-purpose & lightning fast cluster computing system. It provides a high-level API. For example, Java, Scala, Python, and R. Apache Spark is a tool for Running Spark Applications. Spark is 100 times faster than Bigdata Hadoop and 10 times faster than accessing data from disk. Spark is written in Scala but provides rich APIs in Scala, Java, Python, and R. Hadoop MapReduce can only perform batch processing. Apache Storm / S4 can only perform stream processing. Apache Impala / Apache Tez can only perform interactive processing Neo4j / Apache Giraph can only perform graph processing Hence in the industry, there is a big demand for a powerful engine that can process the data in real-time (streaming) as well as in batch mode. There is a need for an engine that can respond in sub-second and perform in-memory processing. Apache Spark Definition says it is a powerful open-source engine that provides real-time stream processing, interactive processing, graph processing, in-memory processing as well as batch processing with very fast speed, ease of use and standard interface. This creates the difference between Hadoop vs Spark and also makes a huge comparison between Spark vs Storm.
Software components used:
- Apache Spark
- Intellij
- JDK
- Scala
Tasks:
Task1:
Write a spark program with an interesting use case using text data as the input and program should have at least Two Spark Transformations and Two Spark Actions.
Created a input file and placed it in input directory of wordcount project.
in the wordcount.scala file, made sparkconf setup and assigned it to val sc. Read the input file using sc.textfile() function. Separated words in the text file and placed them in variable words.
applied map and reduce operations and placed in val count. Separated words in text file are ordered in ascending and printed first three words using the take function.
Seperated each letter and got the count of each letter in the text file and printed it.
Outputs:
Letter count by key:
Wordcount by key:
Task2:
Secondary sort on sort text file which is in input directory.
sort.txt text file:
Below is the code for Inverted index.
Outputs:
Output1:
aggregated year and month with - seperator and column 4 with coma separator:
Aggrgated and sorted rows based on year. On the sorted data applied sorting on the fourth column.
Limitations of Spark:
- No File Management System.
- No Real-Time Data Processing.
- Expensive.
- Small Files Issue.
- Latency.
- The lesser number of Algorithms
Conclusion: Performed various transformations and actions using Apache Spark.