EMR on EKS - suprakashn/aws-services-expl-rsrch GitHub Wiki
This section of the document talks about creating EKS cluster using Fargate and submitting spark jobs directly within the cluster using EMR on EKS service. Information around containerization is covered in the next section.
- Install awscli and configure it with aws keys so that your system can interact with AWS.
- Install eksctl and kubectl in your system. Use these links for instructions (https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html & https://docs.aws.amazon.com/eks/latest/userguide/eksctl.html)
Use the sample template (create_EKS_cluster.yaml) to create the EKS cluster. The template creates the cluster with two Fargate profiles (associated with 2 kubernetes namespaces). Execute the following command from the CLI. Since this cluster is based on Fargate, there would be no EC2 instance running after the creation of the cluster.
eksctl create cluster -f create_EKS_cluster.yaml
Create an IAM OIDC provider for your cluster if it is already not created. In order to test that you can execute the following commands from CLI.
aws eks describe-cluster --name "<cluster-name>" --query "cluster.identity.oidc.issuer" --output text
Output Example: oidc.eks.us-east-2.amazonaws.com/id/115D1C58230BE51D6DCCE7A879E94066
aws iam list-open-id-connect-providers | grep 115D1C58230BE51D6DCCE7A879E94066
If there is any output then the identity provider is already created. If no result then create IAM OIDC provider by invoking the next command.
eksctl utils associate-iam-oidc-provider --cluster "<cluster-name>" --approve
This step is to provide EMR on EKS the access to the namespace created within the Fargate profile. In this example there were couple of namespaces created in the cluster (default, kube-system).
eksctl create iamidentitymapping \
--cluster <cluster-name> \
--namespace <name-space> \
--service-name "emr-containers"
The IAM role which would be used to execute the jobs in the EMR would have to have the namespace as a trusted entity. For that you need to establish the trust using the OIDC provider created earlier
aws emr-containers update-role-trust-policy \
--cluster-name <cluster-name> \
--namespace <namespace> \
--role-name <IAM-Role>
Create the virtual cluster associated to the namespace. Keep a note of the virtual-cluster-id. It would be used to submit any jobs in the future
aws emr-containers create-virtual-cluster \
--name <virtual-cluster-name> \
--container-provider '{
"id": "<cluster-name>",
"type": "EKS",
"info": {
"eksInfo": {
"namespace": "<namespace>"
}
}
}'
Submitting the spark job (with dynamic resource allocation) in the cluster
aws emr-containers start-job-run \
--virtual-cluster-id=<virtual-cluster-id> \
--name=<job-name> \
--execution-role-arn=<IAM-Role> \
--release-label=emr-6.3.0-latest \
--job-driver='{
"sparkSubmitJobDriver": {
"entryPoint": "<script-name>",
"sparkSubmitParameters": "<spark-conf-parameters>"
}
}'\
--configuration-overrides='{
"applicationConfiguration": [
{
"classification": "spark-defaults",
"properties": {
"spark.dynamicAllocation.enabled":"true",
"spark.dynamicAllocation.shuffleTracking.enabled":"true",
"spark.dynamicAllocation.minExecutors":"<min executors>",
"spark.dynamicAllocation.maxExecutors":"<max executors>",
"spark.dynamicAllocation.initialExecutors":"<init executors>",
"spark.dynamicAllocation.schedulerBacklogTimeout": "1s",
"spark.dynamicAllocation.executorIdleTimeout": "5s"
}
}
],
"monitoringConfiguration":
{
"persistentAppUI": "ENABLED",
"s3MonitoringConfiguration": {
"logUri": "<s3-log-path>"
}
}
}'
Example:
aws emr-containers start-job-run \
--virtual-cluster-id=jerdthnadi6lzionex9shyw71 \
--name=2482-spark-on-eks-demo \
--execution-role-arn=arn:aws:iam::076931226898:role/2482-misc-service-role \
--release-label=emr-6.3.0-latest \
--job-driver='{
"sparkSubmitJobDriver": {
"entryPoint": "s3://2482-bucket/code/sample.py",
"sparkSubmitParameters": "--conf spark.executor.instances=1 --conf spark.executor.memory=1G --conf spark.executor.cores=1 --conf spark.driver.cores=1"
}
}'\
--configuration-overrides='{
"applicationConfiguration": [
{
"classification": "spark-defaults",
"properties": {
"spark.dynamicAllocation.enabled":"true",
"spark.dynamicAllocation.shuffleTracking.enabled":"true",
"spark.dynamicAllocation.minExecutors":"1",
"spark.dynamicAllocation.maxExecutors":"10",
"spark.dynamicAllocation.initialExecutors":"1",
"spark.dynamicAllocation.schedulerBacklogTimeout": "1s",
"spark.dynamicAllocation.executorIdleTimeout": "5s"
}
}
],
"monitoringConfiguration":
{
"persistentAppUI": "ENABLED",
"s3MonitoringConfiguration": {
"logUri": "s3://2482-bucket/logs/"
}
}
}'
This section talks about containerizing the spark jobs and submitting them to the EKS cluster. All the steps till Step-5 from the previous section are still required. Process of creating the image, registering it with ECR, spark job submission etc is covered here.
Install Docker: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/docker-basics.html
Installation can be done on local system or an EC2 instance. For me I had some issues with the version and hence used an EC2 instance for Docker and further steps mentioned here.
To build the container, a base image has to be used. Use the link (https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/docker-custom-images-tag.html) to choose the base image. I have used the base image for US-EAST-2 for EMR Version 6.3.0.
aws ecr get-login-password --region us-east-2 | docker login --username AWS --password-stdin 711395599931.dkr.ecr.us-east-2.amazonaws.com
docker pull 711395599931.dkr.ecr.us-east-2.amazonaws.com/spark/emr-6.3.0:latest
Create a docker file which would be used to build the container image. A basic dockerfile is available in this repository for reference. Following command will create the final image using the base image and the docker file.
docker build -t <image-name> .
aws ecr create-repository --repository-name <repo-name> --region us-east-2
docker tag <image-name> 076931226898.dkr.ecr.us-east-2.amazonaws.com/<repo-name>
aws ecr get-login-password | docker login --username AWS --password-stdin 076931226898.dkr.ecr.us-east-2.amazonaws.com
docker push 076931226898.dkr.ecr.us-east-2.amazonaws.com/<repo-name>
Submitting the spark job (with dynamic resource allocation) in the docker image
aws emr-containers start-job-run \
--virtual-cluster-id=<virtual-cluster-id> \
--name=<job-name> \
--execution-role-arn=<IAM-Role> \
--release-label=emr-6.3.0-latest \
--job-driver='{
"sparkSubmitJobDriver": {
"entryPoint": "<script-name>",
"sparkSubmitParameters": "<spark-conf-parameters>
--conf spark.kubernetes.container.image=076931226898.dkr.ecr.us-east-2.amazonaws.com/<repo-name>"
}
}'\
--configuration-overrides='{
"applicationConfiguration": [
{
"classification": "spark-defaults",
"properties": {
"spark.dynamicAllocation.enabled":"true",
"spark.dynamicAllocation.shuffleTracking.enabled":"true",
"spark.dynamicAllocation.minExecutors":"<min executors>",
"spark.dynamicAllocation.maxExecutors":"<max executors>",
"spark.dynamicAllocation.initialExecutors":"<init executors>",
"spark.dynamicAllocation.schedulerBacklogTimeout": "1s",
"spark.dynamicAllocation.executorIdleTimeout": "5s"
}
}
],
"monitoringConfiguration":
{
"persistentAppUI": "ENABLED",
"s3MonitoringConfiguration": {
"logUri": "<s3-log-path>"
}
}
}'
Example:
aws emr-containers start-job-run \
--virtual-cluster-id=jerdthnadi6lzionex9shyw71 \
--name=2482-spark-on-eks-demo \
--execution-role-arn=arn:aws:iam::076931226898:role/2482-misc-service-role \
--release-label=emr-6.3.0-latest \
--job-driver='{
"sparkSubmitJobDriver": {
"entryPoint": "s3://2482-bucket/code/sample.py",
"sparkSubmitParameters": "--conf spark.executor.instances=1 --conf spark.executor.memory=1G --conf spark.executor.cores=1 --conf spark.driver.cores=1 --conf spark.kubernetes.container.image=076931226898.dkr.ecr.us-east-2.amazonaws.com/2482-spark-images"
}
}'\
--configuration-overrides='{
"applicationConfiguration": [
{
"classification": "spark-defaults",
"properties": {
"spark.dynamicAllocation.enabled":"true",
"spark.dynamicAllocation.shuffleTracking.enabled":"true",
"spark.dynamicAllocation.minExecutors":"1",
"spark.dynamicAllocation.maxExecutors":"10",
"spark.dynamicAllocation.initialExecutors":"1",
"spark.dynamicAllocation.schedulerBacklogTimeout": "1s",
"spark.dynamicAllocation.executorIdleTimeout": "5s"
}
}
],
"monitoringConfiguration":
{
"persistentAppUI": "ENABLED",
"s3MonitoringConfiguration": {
"logUri": "s3://2482-bucket/logs/"
}
}
}'