ICP 08 : Apache Spark Introduction - acikgozmehmet/BigDataProgramming GitHub Wiki

ICP 08: Apache Spark Introduction

Apache Spark

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance

Installation:

http://allaboutscala.com/tutorials/chapter-1-getting-familiar-intellij-ide/scala-tutorial-first-hello-world-application/

https://www.youtube.com/watch?v=FbwSbsOYqSE


Spark Programming

1. Counting Trending Hashtags

The problem is to count the hashtags from twitter data and writing the results into an output directory.

Algorithm:

The algorithm applies transformation like flatMap and Map to perform the Map function in Map-Reduce paradigm and ReduceByKey transformation for the Reduce function. Then it performs an action to save the result into a file.

2. Secondary Sorting

A secondary sort problem relates to sorting values associated with a key in the reduce phase. Sometimes, it is called value-to-key conversion. The secondary sorting technique will enable us to sort the values (in ascending or descending order) passed to each reducer.

Algorithm: Algorithm reads the key-value pair and then maps to the RDD using transformation like map and then partitioned them using HashPartitioner. Then it uses mapValues transformation to put the values in a list by sorting the values in ascending order. Finally, it writes the result into a file.

3. Retail Revenue

The problem is to determine the revenues for each production the order list.

Algorithm:

Th algorithm reads product_id and sales amount from each line by using map transformation and applies ReduceByKey and SortByKey transformations to determine the revenue for each product. Finally it saves the result into a file depicting product_id and revenue from that product.

References:

https://www.oreilly.com/library/view/data-algorithms/9781491906170/ch01.html

https://stdatalabs.com/2017/02/mapreduce-vs-spark-secondary-sor/

https://www.quora.com/What-is-secondary-sort-in-Hadoop-and-how-does-it-work

https://www.oreilly.com/library/view/data-algorithms/9781491906170/ch01.html

https://www.ibm.com/support/knowledgecenter/en/SSZJPZ_11.7.0/com.ibm.swg.im.iis.ds.parjob.dev.doc/topics/rangepartitioner.html