ICP 8 - Murarishetti-Shiva-Kumar/Big-Data-Programming GitHub Wiki

Lesson Plan8: Apache Spark

Steps for installation of spark and scala

1.Install spark with suitable version of Hadoop

2.Set path for spark home in the user & system environmental variables

3.Add the bin folder of Spark Home to existing path variables in both user & system environmental variables

4.Now download the winutils of specific hadoop version and place it inside the bin of the spark

5.Now type spark-shell and check the scala version

6.Now install intellij and add scala plugin to the software

Creation of project in Intellij

1.Create a new project in intellij and select scala and sbt

2.Now click next and give the project name

3.Now create the scala project as below

4.Now put the dependencies inside the build.sbt file

scalaVersion := "2.11.8"

libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.1.0", "org.apache.spark" %% "spark-sql" % "2.1.0" )

5.Now create a new package inside the scala folder and create a scala object inside the package

Project folder structure looks like this:

6.Now create a run configuration with main class as packagename.objectname and set the arguments as input output and select the default jdk

1. Spark Programming: for word count

Write a spark program with an interesting use case using text data as the input and program should have at least Two Spark Transformations and Two Spark Actions.

Create an input file with some data inside the input folder to calculate the wordcount of that input file

Source folder of ICP 8 contains the file called "WordCount.scala" which has the code for word count as shown in below.

This code also consists of two transformations namely count() and top().

spark wordcount takes place in three phases like the map, flat map, reduce

The input which we are giving will be splitted into lines and later in to words with the help of flat map and each word is mapped with a 1 with the help of the map function which later undergoes reduction to key value pairs of wordcount.

output after reduce phase

Two Spark Transformations and Actions

1.The spark transformations involved here are map and flatmap.

2.And the actions are count() , map() and take()

3.The count() function is used to calculate the number of words in the input .Here the input contains all the values in the words.

2. Secondary Sorting in Map Reduce

Secondary sorting is used to sort the values in the reducer phase. Take any input of your interest and perform secondary sorting on it.

file called "SecondarySort.scala" which has the code for Secondary Sorting as shown in below.

Create an input file with some data inside it

Grouping the data with the key-value pairs and mapping the data correspondingly

setting the partitions or the number of reducers to 1 to match the output given.(The default reducers or partitions is 2)

output after partition is