EMR on EKS - suprakashn/aws-services-expl-rsrch Wiki

SUBMIT SPARK JOBS ON FARGATE CLUSTER

This section of the document talks about creating EKS cluster using Fargate and submitting spark jobs directly within the cluster using EMR on EKS service. Information around containerization is covered in the next section.

Prerequisites:

Step 1: Create EKS Cluster

Use the sample template (create_EKS_cluster.yaml) to create the EKS cluster. The template creates the cluster with two Fargate profiles (associated with 2 kubernetes namespaces). Execute the following command from the CLI. Since this cluster is based on Fargate, there would be no EC2 instance running after the creation of the cluster.

eksctl create cluster -f create_EKS_cluster.yaml

Step 2: Identity provider creation

Create an IAM OIDC provider for your cluster if it is already not created. In order to test that you can execute the following commands from CLI.

aws eks describe-cluster --name "<cluster-name>" --query "cluster.identity.oidc.issuer" --output text

Output Example: oidc.eks.us-east-2.amazonaws.com/id/115D1C58230BE51D6DCCE7A879E94066

aws iam list-open-id-connect-providers | grep 115D1C58230BE51D6DCCE7A879E94066

If there is any output then the identity provider is already created. If no result then create IAM OIDC provider by invoking the next command.

eksctl utils associate-iam-oidc-provider --cluster "<cluster-name>" --approve

Step 3: Allow access to EMR on EKS to Fargate namespace

This step is to provide EMR on EKS the access to the namespace created within the Fargate profile. In this example there were couple of namespaces created in the cluster (default, kube-system).

eksctl create iamidentitymapping \
  --cluster <cluster-name> \
  --namespace <name-space> \
  --service-name "emr-containers"

Step 4: Establish trust between the namespace and IAM role

The IAM role which would be used to execute the jobs in the EMR would have to have the namespace as a trusted entity. For that you need to establish the trust using the OIDC provider created earlier

aws emr-containers update-role-trust-policy \
  --cluster-name <cluster-name> \
  --namespace <namespace> \
  --role-name <IAM-Role>

Step 5: Register EKS with EMR

Create the virtual cluster associated to the namespace. Keep a note of the virtual-cluster-id. It would be used to submit any jobs in the future

aws emr-containers create-virtual-cluster \
  --name <virtual-cluster-name> \
  --container-provider '{
    "id": "<cluster-name>",
    "type": "EKS",
    "info": {
      "eksInfo": {
       "namespace": "<namespace>"
      }
    }
  }'

Step 6: Submit spark job

Submitting the spark job (with dynamic resource allocation) in the cluster

aws emr-containers start-job-run \
  --virtual-cluster-id=<virtual-cluster-id> \
  --name=<job-name> \
  --execution-role-arn=<IAM-Role> \
  --release-label=emr-6.3.0-latest \
  --job-driver='{
    "sparkSubmitJobDriver": {
      "entryPoint": "<script-name>",
      "sparkSubmitParameters": "<spark-conf-parameters>"
    }
  }'\
  --configuration-overrides='{
  	"applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.dynamicAllocation.enabled":"true",
          "spark.dynamicAllocation.shuffleTracking.enabled":"true",
          "spark.dynamicAllocation.minExecutors":"<min executors>",
          "spark.dynamicAllocation.maxExecutors":"<max executors>",
          "spark.dynamicAllocation.initialExecutors":"<init executors>",
          "spark.dynamicAllocation.schedulerBacklogTimeout": "1s",
          "spark.dynamicAllocation.executorIdleTimeout": "5s"
         }
      }
    ],
	"monitoringConfiguration":
	  {
		"persistentAppUI": "ENABLED",
		"s3MonitoringConfiguration": {
			"logUri": "<s3-log-path>"
		 }
	  }
  }'

Example:

aws emr-containers start-job-run \
  --virtual-cluster-id=jerdthnadi6lzionex9shyw71 \
  --name=2482-spark-on-eks-demo \
  --execution-role-arn=arn:aws:iam::076931226898:role/2482-misc-service-role \
  --release-label=emr-6.3.0-latest \
  --job-driver='{
    "sparkSubmitJobDriver": {
      "entryPoint": "s3://2482-bucket/code/sample.py",
      "sparkSubmitParameters": "--conf spark.executor.instances=1 --conf spark.executor.memory=1G --conf spark.executor.cores=1 --conf spark.driver.cores=1"
    }
  }'\
  --configuration-overrides='{
  	"applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.dynamicAllocation.enabled":"true",
          "spark.dynamicAllocation.shuffleTracking.enabled":"true",
          "spark.dynamicAllocation.minExecutors":"1",
          "spark.dynamicAllocation.maxExecutors":"10",
          "spark.dynamicAllocation.initialExecutors":"1",
          "spark.dynamicAllocation.schedulerBacklogTimeout": "1s",
          "spark.dynamicAllocation.executorIdleTimeout": "5s"
         }
      }
    ],
	"monitoringConfiguration":
	  {
		"persistentAppUI": "ENABLED",
		"s3MonitoringConfiguration": {
			"logUri": "s3://2482-bucket/logs/"
		 }
	  }
  }'

SUBMIT CONTAINERIZED SPARK JOBS ON FARGATE CLUSTER

This section talks about containerizing the spark jobs and submitting them to the EKS cluster. All the steps till Step-5 from the previous section are still required. Process of creating the image, registering it with ECR, spark job submission etc is covered here.

Prerequisites:

Install Docker: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/docker-basics.html

Installation can be done on local system or an EC2 instance. For me I had some issues with the version and hence used an EC2 instance for Docker and further steps mentioned here.

Step 1: Pull the required base image

To build the container, a base image has to be used. Use the link (https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/docker-custom-images-tag.html) to choose the base image. I have used the base image for US-EAST-2 for EMR Version 6.3.0.

aws ecr get-login-password --region us-east-2 | docker login --username AWS --password-stdin 711395599931.dkr.ecr.us-east-2.amazonaws.com
docker pull 711395599931.dkr.ecr.us-east-2.amazonaws.com/spark/emr-6.3.0:latest

Step 2: Build the container image

Create a docker file which would be used to build the container image. A basic dockerfile is available in this repository for reference. Following command will create the final image using the base image and the docker file.

docker build -t <image-name> .

Step 3: Register and tag repository in ECR

aws ecr create-repository --repository-name <repo-name> --region us-east-2
docker tag <image-name> 076931226898.dkr.ecr.us-east-2.amazonaws.com/<repo-name>

Step 4: Login & Push the docker image to ECR

aws ecr get-login-password | docker login --username AWS --password-stdin 076931226898.dkr.ecr.us-east-2.amazonaws.com
docker push 076931226898.dkr.ecr.us-east-2.amazonaws.com/<repo-name>

Step 5: Submit spark job within docker image

Submitting the spark job (with dynamic resource allocation) in the docker image

aws emr-containers start-job-run \
  --virtual-cluster-id=<virtual-cluster-id> \
  --name=<job-name> \
  --execution-role-arn=<IAM-Role> \
  --release-label=emr-6.3.0-latest \
  --job-driver='{
    "sparkSubmitJobDriver": {
      "entryPoint": "<script-name>",
      "sparkSubmitParameters": "<spark-conf-parameters>
        --conf spark.kubernetes.container.image=076931226898.dkr.ecr.us-east-2.amazonaws.com/<repo-name>"
    }
  }'\
  --configuration-overrides='{
  	"applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.dynamicAllocation.enabled":"true",
          "spark.dynamicAllocation.shuffleTracking.enabled":"true",
          "spark.dynamicAllocation.minExecutors":"<min executors>",
          "spark.dynamicAllocation.maxExecutors":"<max executors>",
          "spark.dynamicAllocation.initialExecutors":"<init executors>",
          "spark.dynamicAllocation.schedulerBacklogTimeout": "1s",
          "spark.dynamicAllocation.executorIdleTimeout": "5s"
         }
      }
    ],
	"monitoringConfiguration":
	  {
		"persistentAppUI": "ENABLED",
		"s3MonitoringConfiguration": {
			"logUri": "<s3-log-path>"
		 }
	  }
  }'

Example:

aws emr-containers start-job-run \
  --virtual-cluster-id=jerdthnadi6lzionex9shyw71 \
  --name=2482-spark-on-eks-demo \
  --execution-role-arn=arn:aws:iam::076931226898:role/2482-misc-service-role \
  --release-label=emr-6.3.0-latest \
  --job-driver='{
    "sparkSubmitJobDriver": {
      "entryPoint": "s3://2482-bucket/code/sample.py",
      "sparkSubmitParameters": "--conf spark.executor.instances=1 --conf spark.executor.memory=1G --conf spark.executor.cores=1 --conf spark.driver.cores=1 --conf spark.kubernetes.container.image=076931226898.dkr.ecr.us-east-2.amazonaws.com/2482-spark-images"
    }
  }'\
  --configuration-overrides='{
  	"applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.dynamicAllocation.enabled":"true",
          "spark.dynamicAllocation.shuffleTracking.enabled":"true",
          "spark.dynamicAllocation.minExecutors":"1",
          "spark.dynamicAllocation.maxExecutors":"10",
          "spark.dynamicAllocation.initialExecutors":"1",
          "spark.dynamicAllocation.schedulerBacklogTimeout": "1s",
          "spark.dynamicAllocation.executorIdleTimeout": "5s"
         }
      }
    ],
	"monitoringConfiguration":
	  {
		"persistentAppUI": "ENABLED",
		"s3MonitoringConfiguration": {
			"logUri": "s3://2482-bucket/logs/"
		 }
	  }
  }'