ICP 08 : Apache Spark Introduction - acikgozmehmet/BigDataProgramming GitHub Wiki
ICP 08: Apache Spark Introduction
Apache Spark
Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance
Installation:
https://www.youtube.com/watch?v=FbwSbsOYqSE
Spark Programming
1. Counting Trending Hashtags
The problem is to count the hashtags from twitter data and writing the results into an output directory.
Algorithm:
The algorithm applies transformation like flatMap and Map to perform the Map function in Map-Reduce paradigm and ReduceByKey transformation for the Reduce function. Then it performs an action to save the result into a file.
2. Secondary Sorting
A secondary sort problem relates to sorting values associated with a key in the reduce phase. Sometimes, it is called value-to-key conversion. The secondary sorting technique will enable us to sort the values (in ascending or descending order) passed to each reducer.
Algorithm: Algorithm reads the key-value pair and then maps to the RDD using transformation like map and then partitioned them using HashPartitioner. Then it uses mapValues transformation to put the values in a list by sorting the values in ascending order. Finally, it writes the result into a file.
3. Retail Revenue
The problem is to determine the revenues for each production the order list.
Algorithm:
Th algorithm reads product_id and sales amount from each line by using map transformation and applies ReduceByKey and SortByKey transformations to determine the revenue for each product. Finally it saves the result into a file depicting product_id and revenue from that product.
References:
https://www.oreilly.com/library/view/data-algorithms/9781491906170/ch01.html
https://stdatalabs.com/2017/02/mapreduce-vs-spark-secondary-sor/
https://www.quora.com/What-is-secondary-sort-in-Hadoop-and-how-does-it-work
https://www.oreilly.com/library/view/data-algorithms/9781491906170/ch01.html