Spark - shawfdong/hyades GitHub Wiki
Apache Spark is a fast and general engine for large-scale data processing. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well suited to machine learning algorithms.
Download Spark 1.3.1 Pre-built for Hadoop 2.4 and later:
$ cd /scratch $ wget http://apache.cs.utah.edu/spark/spark-1.3.1/spark-1.3.1-bin-hadoop2.4.tgz $ tar xvfz spark-1.3.1-bin-hadoop2.4.tgz
Install Spark 1.3.1 to the Lustre file system:
# cd /pfs/sw/bigdata/ # mkdir spark-1.3.1 # cd spark-1.3.1 # cp -r /scratch/spark-1.3.1-bin-hadoop2.4/* .
Test Spark 1.3.1[1]:
$ module load spark $ run-example SparkPi 10 2>/dev/null Pi is roughly 3.142248 $ spark-shell --master local[2] 2>/dev/null Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.3.1 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_13) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> :quit Stopping spark context. $ pyspark --master local[2] 2>/dev/null Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.3.1 /_/ Using Python version 2.6.6 (r266:84292, Sep 11 2012 08:34:23) SparkContext available as sc, HiveContext available as sqlContext. >>> quit()
In the above tests, we run Spark locally. To run on a cluster, Spark can run both by itself, or over several existing cluster managers[2]. Since we've deployed Hadoop on the GPU nodes, we'll run Spark on YARN (Hadoop NextGen) in cluster mode[3].
Make sure the hadoop module is loaded:
$ module load spark $ module load hadoopwhich, among other things, define the environment variable HADOOP_CONF_DIR:
HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
$ module load python $ pyspark --master yarn-client 2>/dev/null Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.3.1 /_/ Using Python version 2.7.8 (default, Sep 23 2014 11:31:05) SparkContext available as sc, HiveContext available as sqlContext. >>> quit()
$ spark-submit ${SPARK_HOME}/examples/src/main/python/pi.py 10 2>/dev/null Pi is roughly 3.141240 $ spark-submit \ --master yarn-cluster \ ${SPARK_HOME}/examples/src/main/python/pi.py 10 ... ApplicationMaster host: gpu-8.local ApplicationMaster RPC port: 0 queue: default start time: 1430159867536 final status: SUCCEEDED tracking URL: http://gpu-1:8088/proxy/application_1429807409426_0032/A
But wait! Where is the output when running Spark on YARN? The above command printed out a lot of text to standard error, but it printed nothing (no output like "Pi is roughly 3.141240") to standard output, as it did when we ran Spark on locally.
$ yarn logs -applicationId application_1429807409426_0032 15/04/27 11:46:17 INFO client.RMProxy: Connecting to ResourceManager at gpu-1/10.6.7.11:8032 Logs not available at /tmp/logs/dong/logs/application_1429807409426_0032 Log aggregation has not completed or is not enabled.
As it turned out, we didn't turn on log aggregation for YARN; so logs were retained locally on each machine under YARN_APP_LOGS_DIR. In the above case, it was on gpu-8 under /data/logs/userlogs:
[root@gpu-8 ~]# cat /data/logs/userlogs/application_1429807409426_0032/container_1429807409426_0032_01_000001/stdout Pi is roughly 3.138896
Let's enable YARN log aggregation.
Add the following stanza to $HADOOP_HOME/etc/hadoop/yarn-site.xml[4][5]:
<property> <name>yarn.log-aggregation-enable</name> <value>true</value> <description>Whether to enable log aggregation</description> </property>
Restart YARN:
[hduser@gpu-1 ~]$ stop-yarn.sh; start-yarn.sh
Rerun the sample Spark program on YARN:
$ spark-submit \ --master yarn-cluster \ ${SPARK_HOME}/examples/src/main/python/pi.py 10 ... ApplicationMaster host: gpu-7.local ApplicationMaster RPC port: 0 queue: default start time: 1430161817980 final status: SUCCEEDED tracking URL: http://gpu-1:8088/proxy/application_1430161715740_0001/A
The logs are now aggregated in hdfs:///tmp/logs/dong/logs:
$ yarn logs -applicationId application_1430161715740_0001 | grep Pi 15/04/27 12:15:26 INFO client.RMProxy: Connecting to ResourceManager at gpu-1/10.6.7.11:8032 Pi is roughly 3.139288