spark_g5k - mliroz/hadoop_g5k GitHub Wiki
General Overview and Basic Usage
spark_g5k
is a script providing a command-line interface to manage a Spark cluster. As in other scripts, the possible options can be shown with the help command:
$ spark_g5k -h
Apache Spark can be deployed in 3 modes: standalone, on top of Apache Mesos and on top of Hadoop YARN. spark_g5k
provides both the standalone and Hadoop YARN deployments. Here we present a basic usage with the YARN mode.
First, we need to reserve a set of nodes with the oarsub
command.
$ oarsub -I -t allow_classic_ssh -l nodes=4,walltime=2
As we are going to use Spark on top of Hadoop, we first deploy a Hadoop cluster. More information about how to deploy a Hadoop cluster can be found in hg5k.
$ hg5k --create $OAR_NODEFILE --version 2
$ hg5k --bootstrap /home/mliroz/public/sw/hadoop/hadoop-2.6.0.tar.gz
$ hg5k --initialize --start
Suppose that our Hadoop cluster has been assigned the id 1. Now we can create a Spark cluster linked to it. We use the --hid
option to refer to the Hadoop cluster. The nodes for the Spark cluster are going to be the same as those of the Hadoop cluster.
$ spark_g5k --create YARN --hid 1
Now we need to install Apache Spark in all the nodes of the cluster. We provide a path to the binaries. In this example we are using a version publicly available in all the sites of Grid5000.
$ spark_g5k --bootstrap /home/mliroz/public/sw/spark/spark-1.5.1-bin-hadoop2.6.tgz
Once installed, we need to initialize the cluster. This action will configure the nodes depending both on the parameters specified through the configuration files (if any) and the characteristics of the machines of the cluster. We can also start the services in the same command:
$ spark_g5k --initialize --start
Note that by default initialize does a minimum amount of configuration. Optionally, the user can add the option feeling_lucky
which will dimension the executors (memory, cores and number) depending on the resource manager containers (YARN containers in our example).
Now our cluster is ready to execute jobs or to be accessed through the shell. We start a Spark shell in Python with the following command:
$ spark_g5k --shell ipython
Note that we specified ipython
to get the additional features provided by this shell. If we prefer the Scala shell we simply have to execute:
$ spark_g5k --shell scala
Alternatively we can execute a job through Spark's spark-submit
interface. In spark_g5k
we can use the following command:
$ spark_g5k --scala_job /home/mliroz/public/sw/spark/spark-examples-1.5.1-hadoop2.6.0.jar --main_class org.apache.spark.examples.SparkPi
When we are done, we should delete the clusters in order to remove all temporary files created during execution.
$ spark_g5k --delete
$ hg5k --delete