spark_g5k - mliroz/hadoop_g5k GitHub Wiki

General Overview and Basic Usage

spark_g5k is a script providing a command-line interface to manage a Spark cluster. As in other scripts, the possible options can be shown with the help command:

$ spark_g5k -h

Apache Spark can be deployed in 3 modes: standalone, on top of Apache Mesos and on top of Hadoop YARN. spark_g5k provides both the standalone and Hadoop YARN deployments. Here we present a basic usage with the YARN mode.

First, we need to reserve a set of nodes with the oarsub command.

$ oarsub -I -t allow_classic_ssh -l nodes=4,walltime=2

As we are going to use Spark on top of Hadoop, we first deploy a Hadoop cluster. More information about how to deploy a Hadoop cluster can be found in hg5k.

$ hg5k --create $OAR_NODEFILE --version 2
$ hg5k --bootstrap /home/mliroz/public/sw/hadoop/hadoop-2.6.0.tar.gz
$ hg5k --initialize --start

Suppose that our Hadoop cluster has been assigned the id 1. Now we can create a Spark cluster linked to it. We use the --hid option to refer to the Hadoop cluster. The nodes for the Spark cluster are going to be the same as those of the Hadoop cluster.

$ spark_g5k --create YARN --hid 1

Now we need to install Apache Spark in all the nodes of the cluster. We provide a path to the binaries. In this example we are using a version publicly available in all the sites of Grid5000.

$ spark_g5k --bootstrap /home/mliroz/public/sw/spark/spark-1.5.1-bin-hadoop2.6.tgz

Once installed, we need to initialize the cluster. This action will configure the nodes depending both on the parameters specified through the configuration files (if any) and the characteristics of the machines of the cluster. We can also start the services in the same command:

$ spark_g5k --initialize --start

Note that by default initialize does a minimum amount of configuration. Optionally, the user can add the option feeling_lucky which will dimension the executors (memory, cores and number) depending on the resource manager containers (YARN containers in our example).

Now our cluster is ready to execute jobs or to be accessed through the shell. We start a Spark shell in Python with the following command:

$ spark_g5k --shell ipython

Note that we specified ipython to get the additional features provided by this shell. If we prefer the Scala shell we simply have to execute:

$ spark_g5k --shell scala

Alternatively we can execute a job through Spark's spark-submit interface. In spark_g5k we can use the following command:

$ spark_g5k --scala_job /home/mliroz/public/sw/spark/spark-examples-1.5.1-hadoop2.6.0.jar --main_class org.apache.spark.examples.SparkPi

When we are done, we should delete the clusters in order to remove all temporary files created during execution.

$ spark_g5k --delete
$ hg5k --delete