7 Run Scala in Spark - arwankhoiruddin/hadoopLab GitHub Wiki

What is Scala?

Scala (scalable language) is a general purpose, concise, high-level programming language combining both functional programming and object oriented programming.

Why we use Scala?

  • For Spark, scala is a lot faster. Usually it is 10x faster. In some cases, it is 100x faster than python.
  • We can do line-by-line execution in spark-shell to plan or debug our code

Prepare Gutenberg text data

wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt

Line by line example

https://www.javatpoint.com/apache-spark-word-count-example

By file example

Create wordcount.scala

nano wordcount.scala

Then copy these lines

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark._
object SparkWordCount {
   def main(args: Array[String]) {
      val input = sc.textFile("pg20417.txt")             
      val count = input.flatMap(line => line.split(" "))
      .map(word ⇒ (word, 1))
      .reduceByKey(_ + _)       
      count.saveAsTextFile("outfile")
      System.out.println("OK");
   }
}

Run Spark Shell

spark-shell

Load wordcount.scala

:load wordcount.scala

Run wordcount program

SparkWordCount.main(Array())

Quit Spark Shell

:quit

Check the output

cat outfile/part-00000