Big_Data_Programming_ICP_1(Module_2) - kusamdinesh/Big-Data-and-Hadoop GitHub Wiki

Getting started with Apache Spark

Spark Transformation: Spark Transformation is a function that produces new RDD from the existing RDDs. It takes RDD as input and produces one or more RDD as output. Each time it creates new RDD when we apply any transformation. ... Two most basic type of transformations is a map(), filter()

Spark Action: Actions are RDD operations that produce non-RDD values. They materialize a value in a Spark program. In other words, an RDD operation that returns a value of any type but RDD[T] is an action.

Task1: Implementing a MapReduce program using Apache Spark.

Input file:

Python Code

Here we are taking a text file as input and performing transformations and actions on the data input using MapReduce.

Output:

Task2: Implementing a Secondary Sorting algorithm using Apache Spark

Here we are taking a text file as input, which contains the comma-separated time series values, where we are splitting the data by lines and applying flatMap and performing transformations and actions on the data input using MapReduce.

Input file:

Python Code

Output: