7 Run Scala in Spark - arwankhoiruddin/hadoopLab GitHub Wiki
What is Scala?
Scala (scalable language) is a general purpose, concise, high-level programming language combining both functional programming and object oriented programming.
Why we use Scala?
- For Spark, scala is a lot faster. Usually it is 10x faster. In some cases, it is 100x faster than python.
- We can do line-by-line execution in spark-shell to plan or debug our code
Prepare Gutenberg text data
wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt
Line by line example
https://www.javatpoint.com/apache-spark-word-count-example
By file example
Create wordcount.scala
nano wordcount.scala
Then copy these lines
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark._
object SparkWordCount {
def main(args: Array[String]) {
val input = sc.textFile("pg20417.txt")
val count = input.flatMap(line => line.split(" "))
.map(word ⇒ (word, 1))
.reduceByKey(_ + _)
count.saveAsTextFile("outfile")
System.out.println("OK");
}
}
Run Spark Shell
spark-shell
Load wordcount.scala
:load wordcount.scala
Run wordcount program
SparkWordCount.main(Array())
Quit Spark Shell
:quit
Check the output
cat outfile/part-00000