Spark command line - aseldawy/bdtutorials GitHub Wiki

This tutorial describes how to run your Spark applications from the command-line. This is the primary method of running a Spark application in production and on a real distributed cluster. As a prerequisite, you need to have a project, either in Java or Scala, that you can run from IntelliJ. This is further explained in two previous tutorials for Java and Scala. For additional help, a walkthrough video is available.

The first step is to package your project in a JAR file. To do that, from the command line and at your project directory, type mvn package. This command will compile your code and produce a JAR file under the target directory. If successful, the output will look similar to the following:

[INFO] Scanning for projects...
[INFO] 
[INFO] ---------------< edu.ucr.cs.cs226.ucrnetid:spark-scala >----------------
[INFO] Building spark-scala 1.0-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
...
[INFO] 
[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ spark-scala ---
[INFO] Building jar: .../spark-scala/target/spark-scala-1.0-SNAPSHOT.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  11.243 s
[INFO] Finished at: 2020-10-01T17:56:08-07:00
[INFO] ------------------------------------------------------------------------

After getting the success message, double-check that the JAR file is now under the target directory. For example, the JAR file in this case will be named spark-scala-1.0-SNAPSHOT.jar.

Now, let's try to run this JAR file using this java command. Do not forget to replace ucrnetid with yours.

java -cp target/spark-scala-1.0-SNAPSHOT.jar edu.ucr.cs.cs226.ucrnetid.App nasa.tar

It will fail because your program depends on Spark and Scala libraries that are not part of your JAR file. You might get an error like the following one:

Exception in thread "main" java.lang.NoClassDefFoundError: scala/Function2
	at edu.ucr.cs.cs226.ucrnetid.App.main(App.scala)
Caused by: java.lang.ClassNotFoundException: scala.Function2
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)

To solve this problem, we need to install Spark binaries and run the JAR file inside the Spark environment.

Navigate to [https://spark.apache.org] and navigate to Download. Choose the latest binary package. Download the compressed package and extract it in your Applications directory.

cd ~/Applications
curl http://mirrors.ibiblio.org/apache/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz | tar -xvz

To properly configure Spark, we will set two environment variables, SPARK_HOME and PATH. Set SPARK_HOME to the extracted Spark directory and extend the PATH environment variable to include $SPARK_HOME/bin.

echo 'export SPARK_HOME=$HOME/Applications/spark-3.0.1-bin-hadoop3.2' >> ~/.profile
echo 'export PATH=$PATH:$SPARK_HOME/bin' >> ~/.profile

To make sure that Spark is correctly configured, open a new terminal window and type spark-submit --version. It should print out the version of the installed Spark. The output should look similar to the following.

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
      /_/
                        
Using Scala version 2.12.10, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_251
Branch HEAD
Compiled by user ubuntu on 2020-08-28T08:58:35Z
Revision 2b147c4cd50da32fe2b4167f97c8142102a0510d
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.

Now, open a new terminal window at your project directory. If you use IntelliJ, you can open a terminal window right inside IntelliJ. If you do not see this option, right-click at the root of your project and choose Open in Terminal. Note, if you have IntelliJ running before you install Spark, you will need to restart IntelliJ to recognize the new environment variable.

In the terminal window, run your program using the following command.

spark-submit --class edu.ucr.cs.cs226.ucrnetid.App target/spark-scala-1.0-SNAPSHOT.jar nasa.tsv

The class option allows you to choose the main class that you want to run. If your application contains only one main class, you can set it to run by default so that you do not have to type the class name every time you run your application. To do that, edit your pom.xml file and add the following section.

<build>
  <plugins>
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-jar-plugin</artifactId>
      <configuration>
        <archive>
          <manifest>
            <mainClass>edu.ucr.cs.cs226.ucrnetid.App</mainClass>
          </manifest>
        </archive>
      </configuration>
    </plugin>
  </plugins>
</build>

Make sure that you specify the full name of your class with the package and replace ucrnetid with yours. Next, create a new JAR file by running the same command mvn clean package. Now, you can run your program using the following command.

spark-submit target/spark-scala-1.0-SNAPSHOT.jar nasa.tsv

Note: For Windows users, you might get the following error message when you run spark-submit.

java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.

To resolve this problem, download the binary winutils.exe from this page. Then, place this file under Applications/hadoop/bin/. After that, add a new environment variable HADOOP_HOME that points to Applications/hadoop. Now, start a new terminal window to apply the changes and it should work correctly. If you run from IntelliJ, you might need to restart IntelliJ to reflect the new environment variable.

⚠️ **GitHub.com Fallback** ⚠️