Spark - shawfdong/hyades GitHub Wiki

Apache Spark is a fast and general engine for large-scale data processing. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well suited to machine learning algorithms.

Table of Contents

Installation

Download Spark 1.3.1 Pre-built for Hadoop 2.4 and later:

$ cd /scratch
$ wget http://apache.cs.utah.edu/spark/spark-1.3.1/spark-1.3.1-bin-hadoop2.4.tgz
$ tar xvfz spark-1.3.1-bin-hadoop2.4.tgz

Install Spark 1.3.1 to the Lustre file system:

# cd /pfs/sw/bigdata/
# mkdir spark-1.3.1
# cd spark-1.3.1
# cp -r /scratch/spark-1.3.1-bin-hadoop2.4/* .

Test Spark 1.3.1[1]:

$ module load spark

$ run-example SparkPi 10 2>/dev/null
Pi is roughly 3.142248

$ spark-shell --master local[2] 2>/dev/null
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.3.1
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_13)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> :quit
Stopping spark context.

$ pyspark --master local[2] 2>/dev/null
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.3.1
      /_/

Using Python version 2.6.6 (r266:84292, Sep 11 2012 08:34:23)
SparkContext available as sc, HiveContext available as sqlContext.
>>> quit()

Running Spark on YARN

In the above tests, we run Spark locally. To run on a cluster, Spark can run both by itself, or over several existing cluster managers[2]. Since we've deployed Hadoop on the GPU nodes, we'll run Spark on YARN (Hadoop NextGen) in cluster mode[3].

Make sure the hadoop module is loaded:

$ module load spark
$ module load hadoop
which, among other things, define the environment variable HADOOP_CONF_DIR:
HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
$ module load python
$ pyspark --master yarn-client 2>/dev/null
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.3.1
      /_/

Using Python version 2.7.8 (default, Sep 23 2014 11:31:05)
SparkContext available as sc, HiveContext available as sqlContext.
>>> quit()
$ spark-submit ${SPARK_HOME}/examples/src/main/python/pi.py 10 2>/dev/null
Pi is roughly 3.141240

$ spark-submit \
    --master yarn-cluster \
    ${SPARK_HOME}/examples/src/main/python/pi.py 10
...
	 ApplicationMaster host: gpu-8.local
	 ApplicationMaster RPC port: 0
	 queue: default
	 start time: 1430159867536
	 final status: SUCCEEDED
	 tracking URL: http://gpu-1:8088/proxy/application_1429807409426_0032/A

But wait! Where is the output when running Spark on YARN? The above command printed out a lot of text to standard error, but it printed nothing (no output like "Pi is roughly 3.141240") to standard output, as it did when we ran Spark on locally.

$ yarn logs -applicationId application_1429807409426_0032
15/04/27 11:46:17 INFO client.RMProxy: Connecting to ResourceManager at gpu-1/10.6.7.11:8032
Logs not available at /tmp/logs/dong/logs/application_1429807409426_0032
Log aggregation has not completed or is not enabled.

As it turned out, we didn't turn on log aggregation for YARN; so logs were retained locally on each machine under YARN_APP_LOGS_DIR. In the above case, it was on gpu-8 under /data/logs/userlogs:

[root@gpu-8 ~]# cat /data/logs/userlogs/application_1429807409426_0032/container_1429807409426_0032_01_000001/stdout 
Pi is roughly 3.138896

YARN Log Aggregation

Let's enable YARN log aggregation.

Add the following stanza to $HADOOP_HOME/etc/hadoop/yarn-site.xml[4][5]:

  <property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
    <description>Whether to enable log aggregation</description>
  </property>

Restart YARN:

[hduser@gpu-1 ~]$ stop-yarn.sh; start-yarn.sh

Rerun the sample Spark program on YARN:

$ spark-submit \
    --master yarn-cluster \
    ${SPARK_HOME}/examples/src/main/python/pi.py 10
...
	 ApplicationMaster host: gpu-7.local
	 ApplicationMaster RPC port: 0
	 queue: default
	 start time: 1430161817980
	 final status: SUCCEEDED
	 tracking URL: http://gpu-1:8088/proxy/application_1430161715740_0001/A

The logs are now aggregated in hdfs:///tmp/logs/dong/logs:

$ yarn logs -applicationId application_1430161715740_0001 | grep Pi
15/04/27 12:15:26 INFO client.RMProxy: Connecting to ResourceManager at gpu-1/10.6.7.11:8032
Pi is roughly 3.139288

References

  1. ^ Spark Overview
  2. ^ Spark: Cluster Mode Overview
  3. ^ Running Spark on YARN
  4. ^ Simplifying user-logs management and access in YARN
  5. ^ $HADOOP_HOME/share/doc/hadoop/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
⚠️ **GitHub.com Fallback** ⚠️