Spark Project Setup in Scala - aseldawy/bdtutorials GitHub Wiki

This tutorial explains how to set up a project in Scala to work with Spark. Make sure that you have Oracle JDK 8, Maven, IntelliJ installed and configured as explained in a previous tutorial. Also, make sure that the Scala plugin is installed in IntelliJ. A walkthrough video for this tutorial is available.

We will start by creating a new Maven project using command line. This time, we will use a different template that includes Scala.

mvn -B archetype:generate -DarchetypeGroupId=net.alchim31.maven -DarchetypeArtifactId=scala-archetype-simple -DgroupId=edu.ucr.cs.cs226.ucrnetid -DartifactId=spark-scala

Import the project into IntelliJ. Edit the pom.xml file and add Spark dependency.

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.12</artifactId>
    <version>3.0.1</version>
</dependency>

Right-click on the pom.xml file and choose Maven -> Reload Project.

Now, edit the App.scala class so that it looks as follows.

import org.apache.spark.{SparkConf, SparkContext}

/**
 * @author ${user.name}
 */
object App {
  
  def main(args : Array[String]) {
    val conf = new SparkConf().setAppName("scalatest").setMaster("local")
    val sc = new SparkContext(conf)
    
    try {
      val lines = sc.textFile(args(0))
      val lineLengths = lines.map(_.length)
      val totalLength: Int = lineLengths.reduce(_ + _)
      println(s"Total length $totalLength")
    } finally {sc.stop()}
  }
}

Download the nasa.tsv sample file to use as input. If you run the program now it will fail because the input file is not specified in the program arguments. Edit the run configuration and add the input file name in the program arguments. Now, run again to get the correct result.

You can now further extend this program by following the documentation of Apache Spark.

⚠️ **GitHub.com Fallback** ⚠️