AWS EMR - keshavbaweja-git/guides GitHub Wiki

Cluster - a collection of EC2 nodes

  • Master node
  • Core nodes (host HDFS + run tasks)
  • Task nodes (compute nodes, don't host HDFS, only run tasks)

Submitting work to a cluster -

  • Transient cluster - entire job definition is specified as function steps while creating Transient cluster
  • Long running cluster
    • EMR console. EMR API, AWS CLI
    • ssh into master nodes and other cluster nodes as required, submit jobs using software installed

Cluster lifecycle -

  1. EMR provisions EC2 instances as per specs using default AMI or custom AMI provided
  2. EMR runs bootstrap actions specified on each instance, these can be used to install custom applications required
  3. EMR installs native EMR application stack.
  4. Cluster in now in RUNNING state. It starts executing function steps specified at cluster creation, User can connect to cluster and submit additional steps.
  5. Cluster in now in WAITING state. It configured for auto-termination, the cluster auto terminates after execution of last function step.

Storage-

  1. HDFS on Core nodes, ephemeral storage which is lost when cluster terminates, suited for intermediate data
  2. EMRFS on S3, permanent storage for data input and output
  3. Local filesystem on disk attached to cluster nodes, instance store, ephemeral storage.

Cluster resource management -

  • YARN is used for cluster resource management
  • EMR also installs an agent on nodes to manage YARN components
  • Master node and core nodes are not run on Spot instances, EMR relies on YARN "node labels" for this

Amazon EMR Studio

  • Fully managed extension of Jupyter notebook that runs against an EMR cluster
  • SSO to EMR Studio with an IDP configured in AWS
  • EMR Studio is limited to access resources in specified VPC and subnets
  • Launch clusters, submit jobs using Python, Py Spark, Spark Scala, Spark SQL
  • Track and debug jobs using Spark UI/History server, Tez UI, YARN timeline server
  • Persist and share notebooks
  • Scheduled execution of parameterized notebooks using Apache Airflow or Amazon Managed Worflow for Apache Airflow

Amazon EMR Notebook

  • Jupyter Notebook and JupyterLab interface to AWM EMR
  • Create a Notebook from within EMR console

Plan and Configure cluster

  • Launch EMR cluster in the same region in which data is hosted on S3, this avoid cross region data transfer charges.
  • SSH key pair used to connect to cluster should be created in the same region as cluster itself

Getting data into EMR cluster

  • S3, EMR can read input data from S3 and write logs and output data back into it.
  • EMR can be configured for multipart upload to S3
  • When multipart upload is enabled, also enable deletion of failed multipart uploads after a configured interval
  • Distributed Cache - a feature of Hadoop that allows caching of files on core/master node(s). Can be configured at cluster creation time.
  • DynamoDB - EMR has built in support for import and export of data from/to DynamoDB
  • DirectConnect - Connect on premises infrastructure directly to AWS infrastructure for a secure and faster data transfer
  • Import/Export - Use Snowball or SnowMobile to transfer large amounts of data into AWS

Security Configuration

  1. Data encryption
  2. Kerberos authentication
  3. S3 authorization for EMRFS

Once a security configuration has been created, it can be associated with any number of EMR clusters.

S3 Data Encryption

  1. SSE-S3
  2. SSE-KMS
  3. CSE-KMS
  4. CSE-Custom (Specify S3 bucket that holds custom key provider jar file)

Local Data Encryption

  1. AWS KMS, with this option encryption of EBS root device and storage volume can be enabled.
  2. Custom (Specify S3 bucket that holds custom key provider jar file)

In transit encryption

  1. PEM (Specify S3 bucket that holds an archive containing the pem certificate and certificate chain)
  2. Custom (Specify S3 bucket that holds custom key provider jar file)

Local data encryption

Instance Store data encryption

  • For instance types that use NVMe-SSD based instance store, NVMe encryption is used
  • For other instance types, LUKS encryption is used

EBS volume encryption

  • EBS encryption - this option encrypts EBS root device and attached EBS storage volumes. This is recommended option.
  • LUKS encryption - this option encrypts only attached EBS storage volumes and not EBS root device

Access Control

  • Permissions for EMR actions associated with EMR cluster and notebook can be fine-tuned using tag based access control with identity based IAM policies
  • In EMR, condition keys that can be used in a Condition element in a policy definition are only applicable to those EMR APIs where ClusterID or NotebookID is a mandatory attribute
  • Use the elasticmapreduce:ResourceTag/TagKeyString condition context key to allow or deny user actions on clusters or notebooks with tags that have the TagKeyString that you specify. If an action passes both ClusterID and NotebookID, the condition applies to both the cluster and the notebook.
  • Use the elasticmapreduce:RequestTag/TagKeyString condition context key to require a specific tag with actions/API calls. For example, you can use this condition context key along with the CreateEditor action to require that a key with TagKeyString is applied to a notebook when it is created.

Improving Spark Performance With Amazon S3

Use S3 Select with Spark

  • S3 Select allows applications to retrieve only a subset of data from an object. For Amazon EMR, the computational work of filtering large data sets for processing is "pushed down" from the cluster to Amazon S3, which can improve performance in some applications and reduces the amount of data transferred between Amazon EMR and Amazon S3.
  • S3 Select is supported with CSV and JSON files using s3selectCSV and s3selectJSON values to specify the data format

Use EMRFS S3-optimized Committer

  • The EMRFS S3-optimized committer is an alternative OutputCommitter implementation that is optimized for writing files to Amazon S3 when using EMRFS.
  • The committer is used when
    • Spark SQL, DataFrames, or Datasets API is used to write Parquet files
    • Multipart upload is enabled in EMR
  • The committer is not used under the following circumstances:
    • When writing to HDFS
    • When using the S3A file system
    • When using an output format other than Parquet, such as ORC or text
    • When using MapReduce or Spark's RDD API

AWS Glue Pricing

  • Crawlers and ETL jobs - hourly rate, billed by the second
  • AWS Glue Data Catalog - monthly fee for storing and accessing the metadata. The first million objects stored are free, and the first million accesses are free.
  • Development endpoint - hourly rate, billed per second.
  • AWS Glue DataBrew, the interactive sessions are billed per session and the DataBrew jobs are billed per minute.
  • AWS Glue Schema registry is offered at no additional charge

Instance Storage/EBS Volumes

  • Instance storage/EBS volumes provide storage for HDFS, temporary storage for buffers, caches, scratch areas and spill to disk feature used by sone applications
  • EBS volumes attached to EMR cluster nodes are ephemeral
  • Additional EBS volumes can be added as additional storage volume but not as boot volume
  • It is not possible to snapshot and restore an EBS volume attached to an EMR cluster node

Instance configuration

  • Configuration of cluster instances - master, core and task nodes can be specified as either Instance Fleet or Uniform Instance Group

  • EMR Console, API and AWS CLI can be used to create a cluster with one of these two configuration options

  • Instance fleet

    • Offers widest variety of provisioning options for EC2 instances
    • Each node type(master/core/task) node has a single fleet, task node fleet is optional
    • For each instance fleet, up to five EC2 instance types can be specified, which can be provisioned as On-Demand and Spot instances.
    • For core and task instance fleets a target capacity is specified for both On-Demand and Spot instances
    • For master node type a single instance type is chosen from specified list of instance types.
  • Uniform Instance Grop

    • Offer a simplified interface
    • A maximum of 50 instance groups can be associated with a cluster
    • One master instance group that contains one EC2 instance
    • One core instance group that contains one or more EC2 instances
    • Up to 48 instance groups for task node type, can be configured with any number of EC2 instances
  • Integration with AWS Lake Formation

    • Service role for EMR cluster EC2 instances - EMR_EC2_DEFAULT, also known as instance profile
    • For integration with AWS Lake Formation, a different custom role/instance profile should be used. This should be specified at time of cluster creation
    • Following applications are currently supported
      • EMR notebook
      • Apache Zeppelin
      • Apache Spark through EMR notebook