Spark - shawfdong/hyades GitHub Wiki
Apache Spark is a fast and general engine for large-scale data processing. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well suited to machine learning algorithms.
Download Spark 1.3.1 Pre-built for Hadoop 2.4 and later:
$ cd /scratch $ wget http://apache.cs.utah.edu/spark/spark-1.3.1/spark-1.3.1-bin-hadoop2.4.tgz $ tar xvfz spark-1.3.1-bin-hadoop2.4.tgz
Install Spark 1.3.1 to the Lustre file system:
# cd /pfs/sw/bigdata/ # mkdir spark-1.3.1 # cd spark-1.3.1 # cp -r /scratch/spark-1.3.1-bin-hadoop2.4/* .
Test Spark 1.3.1[1]:
$ module load spark
$ run-example SparkPi 10 2>/dev/null
Pi is roughly 3.142248
$ spark-shell --master local[2] 2>/dev/null
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.3.1
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_13)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.
scala> :quit
Stopping spark context.
$ pyspark --master local[2] 2>/dev/null
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.3.1
/_/
Using Python version 2.6.6 (r266:84292, Sep 11 2012 08:34:23)
SparkContext available as sc, HiveContext available as sqlContext.
>>> quit()
In the above tests, we run Spark locally. To run on a cluster, Spark can run both by itself, or over several existing cluster managers[2]. Since we've deployed Hadoop on the GPU nodes, we'll run Spark on YARN (Hadoop NextGen) in cluster mode[3].
Make sure the hadoop module is loaded:
$ module load spark $ module load hadoopwhich, among other things, define the environment variable HADOOP_CONF_DIR:
HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
$ module load python
$ pyspark --master yarn-client 2>/dev/null
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.3.1
/_/
Using Python version 2.7.8 (default, Sep 23 2014 11:31:05)
SparkContext available as sc, HiveContext available as sqlContext.
>>> quit()
$ spark-submit ${SPARK_HOME}/examples/src/main/python/pi.py 10 2>/dev/null
Pi is roughly 3.141240
$ spark-submit \
--master yarn-cluster \
${SPARK_HOME}/examples/src/main/python/pi.py 10
...
ApplicationMaster host: gpu-8.local
ApplicationMaster RPC port: 0
queue: default
start time: 1430159867536
final status: SUCCEEDED
tracking URL: http://gpu-1:8088/proxy/application_1429807409426_0032/A
But wait! Where is the output when running Spark on YARN? The above command printed out a lot of text to standard error, but it printed nothing (no output like "Pi is roughly 3.141240") to standard output, as it did when we ran Spark on locally.
$ yarn logs -applicationId application_1429807409426_0032 15/04/27 11:46:17 INFO client.RMProxy: Connecting to ResourceManager at gpu-1/10.6.7.11:8032 Logs not available at /tmp/logs/dong/logs/application_1429807409426_0032 Log aggregation has not completed or is not enabled.
As it turned out, we didn't turn on log aggregation for YARN; so logs were retained locally on each machine under YARN_APP_LOGS_DIR. In the above case, it was on gpu-8 under /data/logs/userlogs:
[root@gpu-8 ~]# cat /data/logs/userlogs/application_1429807409426_0032/container_1429807409426_0032_01_000001/stdout Pi is roughly 3.138896
Let's enable YARN log aggregation.
Add the following stanza to $HADOOP_HOME/etc/hadoop/yarn-site.xml[4][5]:
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
<description>Whether to enable log aggregation</description>
</property>
Restart YARN:
[hduser@gpu-1 ~]$ stop-yarn.sh; start-yarn.sh
Rerun the sample Spark program on YARN:
$ spark-submit \
--master yarn-cluster \
${SPARK_HOME}/examples/src/main/python/pi.py 10
...
ApplicationMaster host: gpu-7.local
ApplicationMaster RPC port: 0
queue: default
start time: 1430161817980
final status: SUCCEEDED
tracking URL: http://gpu-1:8088/proxy/application_1430161715740_0001/A
The logs are now aggregated in hdfs:///tmp/logs/dong/logs:
$ yarn logs -applicationId application_1430161715740_0001 | grep Pi 15/04/27 12:15:26 INFO client.RMProxy: Connecting to ResourceManager at gpu-1/10.6.7.11:8032 Pi is roughly 3.139288