Spark Project Setup in Java - aseldawy/bdtutorials GitHub Wiki

This tutorial shows how to create your first Spark project in Java. Please make sure that you have the development setup ready as explained in an earlier tutorial, that is, you have JDK 8, Maven, and IntelliJ installed and configured. Check this video walk-through for your convenience.

First, create a new Maven project as we did earlier.

mvn -B archetype:generate -DgroupId=edu.ucr.cs.cs226.ucrnetid -DartifactId=spark-demo -DarchetypeArtifactId=maven-archetype-quickstart

The next step is to add a new dependency for Spark. Search for "Maven Spark" and add the dependency to the pom.xml file.

<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.12</artifactId>
    <version>3.0.1</version>
</dependency>

Also, set the project Java code level to 1.8 to enable lambda expressions which we will use in this example.

<properties>
  <maven.compiler.source>1.8</maven.compiler.source>
  <maven.compiler.target>1.8</maven.compiler.target>
</properties>

Right-click the pom.xml file and choose "Reimport Project" to ensure that the new changes are loaded.

Next, edit the main class and add the following code that initializes a Spark context.

import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.SparkConf;

...

SparkConf conf = new SparkConf().setAppName("test").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);

JavaRDD<String> lines = sc.textFile("nasa.tsv");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
int totalLength = lineLengths.reduce((a, b) -> a + b);
System.out.printf("Total length is %d\n", totalLength);

You can download the sample nasa.tsv file or work on any text file. Run the main class using the green arrow.

Now, let's make the code a little more generic by taking the input file name as a command line argument. Edit the run configuration and add the file name as an argument. Change the code as shown to use the command line argument.

String filename = args[0];
JavaRDD<String> lines = sc.textFile(filename);

Now set the program argument in the run configuration.

You can continue from there and use the Spark documentation to enrich your code.

⚠️ **GitHub.com Fallback** ⚠️