Submit Spark as Add Steps to AWS EMR Running or While creation of Cluster and Then terminate it. - isgaur/AWS-BigData-Solutions GitHub Wiki
Spark on EMR - How to Submit a Spark Application with EMR Steps while creation of cluster and submit add-steps to the running cluster.
To demonstrate I will walk you through few example's using AWS CLI . Please replace <YOUR_EC2_KEY_NAME> and <YOUR_VPC_SUBNET_ID> accordingly to match your EC2 key pair name and VPC subnet. There will be a number of <YOUR_VALUE_HERE> replacements that will need to be done with these examples.
#Example 1 : ## Launch cluster with Spark :
This command creates running cluster with Spark installed and with a custom bootstrap action.
```
aws emr create-cluster --name "spark_emr" --release-label emr-5.29.0 \
--use-default-roles --ec2-attributes KeyName=<your-key>\
--applications Name=Hive Name=Spark \
--instance-type=m3.2xlarge --instance-count 3 \
--instance-count 3 --instance-type m5.xlarge \
--bootstrap-actions Path="s3://<your-s3-bucket>/<your-bootstrap-file-name>"
```
This above command will return a cluster id of the form j-#####. The cluster will also turn up and wait for user to terminate, so be sure to terminate the cluster when done.
#Example 2: ## Add an EMR step to the running cluster to execute SparkPi example via spark-submit and EMR script execution with the spark-examples jar located in S3
```
aws emr add-steps --cluster-id j-2AXXXXXXGAPLF --steps Type=CUSTOM_JAR,Name="Spark Program",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=[spark-example,SparkPi,10]
```
#Example 3: ## Terminate cluster
```
aws emr terminate-clusters --cluster-id <YOUR_CLUSTER_ID>
```
#Example 4: ## Example of an all-in-one AWS CLI command that creates the cluster, runs a Spark application as an add step , then terminates the EMR cluster finally.
```
aws emr create-cluster --name "Test cluster" --release-label emr-5.29.0 \
--use-default-roles --ec2-attributes KeyName=<YOUR_EC2_KEY_NAME> \
--applications Name=Hive Name=Spark \
--instance-type=m3.2xlarge --instance-count 3 \
--instance-count 3 --instance-type m5.xlarge \
--bootstrap-actions Path="s3://<your-s3-bucket>/<your-bootstrap-file-name>" \
--steps Type=spark,Name=SparkWordCountApp,Args=[--deploy-mode,cluster,
--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=false,
--num-executors,5,--executor-cores,5,--executor-memory,20g,
s3://codelocation/wordcount.py,s3://inputbucket/input.txt,
s3://outputbucket/],ActionOnFailure=CONTINUE \
--auto-terminate
```
Here is one of the general guidance AWS documentation for running spark applications on an AWS EMR here [1] and for running add-steps as well here [2].
[1] https://aws.amazon.com/blogs/big-data/submitting-user-applications-with-spark-submit/ [2] https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-submit-step.html