Submit Spark as Add Steps to AWS EMR Running or While creation of Cluster and Then terminate it. - isgaur/AWS-BigData-Solutions GitHub Wiki

Spark on EMR - How to Submit a Spark Application with EMR Steps while creation of cluster and submit add-steps to the running cluster.

Examples

To demonstrate I will walk you through few example's using AWS CLI . Please replace <YOUR_EC2_KEY_NAME> and <YOUR_VPC_SUBNET_ID> accordingly to match your EC2 key pair name and VPC subnet. There will be a number of <YOUR_VALUE_HERE> replacements that will need to be done with these examples.

#Example 1 : ## Launch cluster with Spark :

        This command creates running cluster with Spark installed and with a custom bootstrap action.

        ```
        aws emr create-cluster --name "spark_emr" --release-label emr-5.29.0 \
        --use-default-roles --ec2-attributes KeyName=<your-key>\
        --applications Name=Hive Name=Spark \
        --instance-type=m3.2xlarge --instance-count 3 \
        --instance-count 3 --instance-type m5.xlarge \
        --bootstrap-actions Path="s3://<your-s3-bucket>/<your-bootstrap-file-name>"
        
        ```

        This above command will return a cluster id of the form j-#####. The cluster will also turn up and wait for user to terminate, so be sure to terminate the cluster when done.

#Example 2: ## Add an EMR step to the running cluster to execute SparkPi example via spark-submit and EMR script execution with the spark-examples jar located in S3

    ```
          aws emr add-steps --cluster-id j-2AXXXXXXGAPLF --steps Type=CUSTOM_JAR,Name="Spark Program",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=[spark-example,SparkPi,10]
    ```

#Example 3: ## Terminate cluster

    ```
    aws emr terminate-clusters --cluster-id <YOUR_CLUSTER_ID>
    ```

#Example 4: ## Example of an all-in-one AWS CLI command that creates the cluster, runs a Spark application as an add step , then terminates the EMR cluster finally.

    ```
    aws emr create-cluster --name "Test cluster" --release-label emr-5.29.0 \
    --use-default-roles --ec2-attributes KeyName=<YOUR_EC2_KEY_NAME> \
    --applications Name=Hive Name=Spark \
    --instance-type=m3.2xlarge --instance-count 3 \
    --instance-count 3 --instance-type m5.xlarge \
    --bootstrap-actions Path="s3://<your-s3-bucket>/<your-bootstrap-file-name>" \
    --steps Type=spark,Name=SparkWordCountApp,Args=[--deploy-mode,cluster,
     --master,yarn,--conf,spark.yarn.submit.waitAppCompletion=false,
     --num-executors,5,--executor-cores,5,--executor-memory,20g, 
     s3://codelocation/wordcount.py,s3://inputbucket/input.txt, 
     s3://outputbucket/],ActionOnFailure=CONTINUE \
    --auto-terminate

    ```

Here is one of the general guidance AWS documentation for running spark applications on an AWS EMR here [1] and for running add-steps as well here [2].

[1] https://aws.amazon.com/blogs/big-data/submitting-user-applications-with-spark-submit/ [2] https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-submit-step.html

⚠️ **GitHub.com Fallback** ⚠️