sparkOfficeWebsite - juedaiyuer/researchNote GitHub Wiki

#官网Examples#

These examples give a quick overview of the Spark API. Spark is built on the concept of distributed datasets, which contain arbitrary(任意的) Java or Python objects. You create a dataset from external data, then apply parallel operations to it. The building block of the Spark API is its RDD API. In the RDD API, there are two types of operations: transformations, which define a new dataset based on previous ones, and actions, which kick off（开始） a job to execute on a cluster. On top of Spark’s RDD API, high level APIs are provided, e.g. DataFrame API and Machine Learning API. These high level APIs provide a concise way to conduct certain data operations. In this page, we will show examples using RDD API as well as examples using high level APIs.

##RDD API Examples##

Word Count

###参数###

local				使用一个Worker线程本地化运行Spark（默认）
local[k] 			使用K个Worker线程本地化运行Spark
local[*] 			使用K个Worker线程本地化运行Spark(这里K自动设置为机器的CPU核数)
spark://HOST:PORT 	连接到指定的Spark单机版集群(Spark standalone cluster)master。必须使用master所配置的接口，默认接口7077.如spark://10.10.10.10:7077
mesos://HOST:PORT 	连接到指定的Mesos集群。host参数是Moses master的hostname。必须使用master所配置的接口，默认接口是5050.
yarn-client 		以客户端模式连接到yarn集群，集群位置由环境变量HADOOP_CONF_DIR决定.
yarn-cluster 		以集群模式连接到yarn集群，同样由HADOOP_CONF_DIR决定连接到哪儿

##source##

Apache Spark Examples