04. 应用开发示例 - aliyun/MaxCompute-Spark GitHub Wiki
快速导航
- WordCount
- GraphX PageRank
- Mllib Kmeans-ON-OSS
- OSS UnstructuredData
- JindoFs
- MaxCompute Table读写 Java/Scala示例
- MaxCompute Table读写 PySpark示例
- OSS PySpark示例
- Spark Streaming Loghub支持
- Spark streaming Datahub支持
- Spark streaming Kafka支持
==============================
参考 模板项目工程下载
模板项目工程提供了spark-1.x以及spark-2.x的样例代码,提交方式和代码接口比较相仿,由于2.x为主流推荐版本,本文将以spark-2.x为主做案例演示。
详细代码 提交方式
cd /path/to/MaxCompute-Spark/spark-2.x
mvn clean package
# 环境变量 spark-defaults.conf 等等以及参考下面链接配置完毕
# https://github.com/aliyun/MaxCompute-Spark/wiki/02.-%E4%BD%BF%E7%94%A8Spark%E5%AE%A2%E6%88%B7%E7%AB%AF%E6%8F%90%E4%BA%A4%E4%BB%BB%E5%8A%A1(Yarn-Cluster%E6%A8%A1%E5%BC%8F)
cd $SPARK_HOME
bin/spark-submit --master yarn-cluster --class \
com.aliyun.odps.spark.examples.WordCount \
/path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar
详细代码 提交方式
cd /path/to/MaxCompute-Spark/spark-2.x
mvn clean package
# 环境变量 spark-defaults.conf 等等以及参考下面链接配置完毕
# https://github.com/aliyun/MaxCompute-Spark/wiki/02.-%E4%BD%BF%E7%94%A8Spark%E5%AE%A2%E6%88%B7%E7%AB%AF%E6%8F%90%E4%BA%A4%E4%BB%BB%E5%8A%A1(Yarn-Cluster%E6%A8%A1%E5%BC%8F)
cd $SPARK_HOME
bin/spark-submit --master yarn-cluster --class \
com.aliyun.odps.spark.examples.graphx.PageRank \
/path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar
spark.hadoop.fs.oss.ststoken.roleArn
spark.hadoop.fs.oss.endpoint
如何填写请参考 OSS StsToken授权文档
详细代码 提交方式
# 编辑代码
val modelOssDir = "oss://bucket/kmeans-model" // 填写具体的oss bucket路径
val spark = SparkSession
.builder()
.config("spark.hadoop.fs.oss.credentials.provider", "org.apache.hadoop.fs.aliyun.oss.AliyunStsTokenCredentialsProvider")
.config("spark.hadoop.fs.oss.ststoken.roleArn", "acs:ram::****:role/aliyunodpsdefaultrole")
.config("spark.hadoop.fs.oss.endpoint", "oss-cn-hangzhou-zmf.aliyuncs.com")
.appName("KmeansModelSaveToOss")
.getOrCreate()
cd /path/to/MaxCompute-Spark/spark-2.x
mvn clean package
# 环境变量 spark-defaults.conf 等等以及参考下面链接配置完毕
# https://github.com/aliyun/MaxCompute-Spark/wiki/02.-%E4%BD%BF%E7%94%A8Spark%E5%AE%A2%E6%88%B7%E7%AB%AF%E6%8F%90%E4%BA%A4%E4%BB%BB%E5%8A%A1(Yarn-Cluster%E6%A8%A1%E5%BC%8F)
cd $SPARK_HOME
bin/spark-submit --master yarn-cluster --class \
com.aliyun.odps.spark.examples.mllib.KmeansModelSaveToOss \
/path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar
spark.hadoop.fs.oss.ststoken.roleArn
spark.hadoop.fs.oss.endpoint
如何填写请参考 OSS StsToken授权文档
详细代码 提交方式
# 编辑代码
val pathIn = "oss://bucket/inputdata/" // 填写具体的oss bucket路径
val spark = SparkSession
.builder()
.config("spark.hadoop.fs.oss.credentials.provider", "org.apache.hadoop.fs.aliyun.oss.AliyunStsTokenCredentialsProvider")
.config("spark.hadoop.fs.oss.ststoken.roleArn", "acs:ram::****:role/aliyunodpsdefaultrole")
.config("spark.hadoop.fs.oss.endpoint", "oss-cn-hangzhou-zmf.aliyuncs.com")
.appName("SparkUnstructuredDataCompute")
.getOrCreate()
cd /path/to/MaxCompute-Spark/spark-2.x
mvn clean package
# 环境变量 spark-defaults.conf 等等以及参考下面链接配置完毕
# https://github.com/aliyun/MaxCompute-Spark/wiki/02.-%E4%BD%BF%E7%94%A8Spark%E5%AE%A2%E6%88%B7%E7%AB%AF%E6%8F%90%E4%BA%A4%E4%BB%BB%E5%8A%A1(Yarn-Cluster%E6%A8%A1%E5%BC%8F)
cd $SPARK_HOME
bin/spark-submit --master yarn-cluster --class \
com.aliyun.odps.spark.examples.oss.SparkUnstructuredDataCompute \
/path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar
spark.hadoop.fs.AbstractFileSystem.oss.impl
spark.hadoop.fs.oss.impl
spark.hadoop.fs.jfs.cache.oss.credentials.provider
spark.hadoop.aliyun.oss.provider.url
如何填写请参考 Jindo sdk接入说明
详细代码 提交方式
cd /path/to/MaxCompute-Spark/spark-2.x
mvn clean package
# 环境变量 spark-defaults.conf 等等参考下面链接配置完毕
# https://github.com/aliyun/MaxCompute-Spark/wiki/02.-%E4%BD%BF%E7%94%A8Spark%E5%AE%A2%E6%88%B7%E7%AB%AF%E6%8F%90%E4%BA%A4%E4%BB%BB%E5%8A%A1(Yarn-Cluster%E6%A8%A1%E5%BC%8F)
# 以及 https://github.com/aliyun/MaxCompute-Spark/wiki/08.-Jindo-sdk%E6%8E%A5%E5%85%A5%E8%AF%B4%E6%98%8E
cd $SPARK_HOME
bin/spark-submit --master yarn-cluster --class \
com.aliyun.odps.spark.examples.oss.JindoFsDemo \
/path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar \
${aliyun-uid} ${role} ${oss-bucket} ${oss-path}
详细代码 提交方式
cd /path/to/MaxCompute-Spark/spark-2.x
mvn clean package
# 环境变量 spark-defaults.conf 等等以及参考下面链接配置完毕
# https://github.com/aliyun/MaxCompute-Spark/wiki/02.-%E4%BD%BF%E7%94%A8Spark%E5%AE%A2%E6%88%B7%E7%AB%AF%E6%8F%90%E4%BA%A4%E4%BB%BB%E5%8A%A1(Yarn-Cluster%E6%A8%A1%E5%BC%8F)
cd $SPARK_HOME
bin/spark-submit --master yarn-cluster --class \
com.aliyun.odps.spark.examples.sparksql.SparkSQL \
/path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar
详细代码 提交方式
# 环境变量 spark-defaults.conf 等等以及参考下面链接配置完毕
# https://github.com/aliyun/MaxCompute-Spark/wiki/02.-%E4%BD%BF%E7%94%A8Spark%E5%AE%A2%E6%88%B7%E7%AB%AF%E6%8F%90%E4%BA%A4%E4%BB%BB%E5%8A%A1(Yarn-Cluster%E6%A8%A1%E5%BC%8F)
cd $SPARK_HOME
bin/spark-submit --master yarn-cluster --jars /path/to/odps-spark-datasource_2.11-3.3.8-public.jar \
/path/to/MaxCompute-Spark/spark-2.x/src/main/python/spark_sql.py
详细代码 提交方式
# 环境变量 spark-defaults.conf 等等以及参考下面链接配置完毕
# https://github.com/aliyun/MaxCompute-Spark/wiki/02.-%E4%BD%BF%E7%94%A8Spark%E5%AE%A2%E6%88%B7%E7%AB%AF%E6%8F%90%E4%BA%A4%E4%BB%BB%E5%8A%A1(Yarn-Cluster%E6%A8%A1%E5%BC%8F)
# oss相关配置参考: https://github.com/aliyun/MaxCompute-Spark/wiki/08.-Oss-Access%E6%96%87%E6%A1%A3%E8%AF%B4%E6%98%8E
cd $SPARK_HOME
bin/spark-submit --master yarn-cluster --jars /path/to/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar \
/path/to/MaxCompute-Spark/spark-2.x/src/main/python/spark_oss.py
其中需要加入的这个spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar可以从https://github.com/aliyun/MaxCompute-Spark/tree/master/spark-2.x编译得到,主要是需要里面的OSS相关的依赖。
详细代码 提交方式
# 环境变量 spark-defaults.conf 等等以及参考下面链接配置完毕
# https://github.com/aliyun/MaxCompute-Spark/wiki/02.-%E4%BD%BF%E7%94%A8Spark%E5%AE%A2%E6%88%B7%E7%AB%AF%E6%8F%90%E4%BA%A4%E4%BB%BB%E5%8A%A1(Yarn-Cluster%E6%A8%A1%E5%BC%8F)
cd $SPARK_HOME
bin/spark-submit --master yarn-cluster --class \
com.aliyun.odps.spark.examples.streaming.loghub.LogHubStreamingDemo \
/path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar
详细代码 提交方式
# 环境变量 spark-defaults.conf 等等以及参考下面链接配置完毕
# https://github.com/aliyun/MaxCompute-Spark/wiki/02.-%E4%BD%BF%E7%94%A8Spark%E5%AE%A2%E6%88%B7%E7%AB%AF%E6%8F%90%E4%BA%A4%E4%BB%BB%E5%8A%A1(Yarn-Cluster%E6%A8%A1%E5%BC%8F)
cd $SPARK_HOME
bin/spark-submit --master yarn-cluster --class \
com.aliyun.odps.spark.examples.streaming.datahub.DataHubStreamingDemo \
/path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar
详细代码 提交方式
# 环境变量 spark-defaults.conf 等等以及参考下面链接配置完毕
# https://github.com/aliyun/MaxCompute-Spark/wiki/02.-%E4%BD%BF%E7%94%A8Spark%E5%AE%A2%E6%88%B7%E7%AB%AF%E6%8F%90%E4%BA%A4%E4%BB%BB%E5%8A%A1(Yarn-Cluster%E6%A8%A1%E5%BC%8F)
cd $SPARK_HOME
bin/spark-submit --master yarn-cluster --class \
com.aliyun.odps.spark.examples.streaming.kafka.KafkaStreamingDemo \
/path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar