EMR 003 EMR Two Step Bootstrap Actions - qyjohn/AWS_Tutorials GitHub Wiki

From time to time, we need to run a script on all nodes in an EMR cluster. This is commonly achieved by using a bootstrap action. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Amazon EMR installs specified applications and the node begins processing data. If you add nodes to a running cluster, bootstrap actions run on those nodes also.

EMR Bootstrap Actions

However, this makes it difficult to deploy application-specific configuration files to all nodes. For example, if you want to add a Presto Connector to your EMR cluster, the connector needs to reside in the Presto configuration folder (/etc/presto/conf/catalog). This folder does not exist until Presto is installed, while the bootstrap action script runs before EMR installs Presto. If you force the bootstrap script to create the destination folder /etc/presto/conf/catalog before Presto is installed by EMR, you will find Presto failed to start.

As such, it is desired to be able to run a script on all nodes AFTER a certain criteria is met, for example, the Presto configuration folder (/etc/presto/conf/catalog) becomes available. This can be achieved by a two-step approach.

For example, you can have the following 2 PostgreSQL connectors, each pointing to a PostgreSQL database:

postgre_01.properties

connector.name=postgresql
connection-url=jdbc:postgresql://server_01:5432/database_01
connection-user=username
connection-password=secret

postgre_02.properties

connector.name=postgresql
connection-url=jdbc:postgresql://server_02:5432/database_02
connection-user=username
connection-password=secret

Let's assume that you have these two properties files in your S3 bucket as s3://your-bucket-name/postgre_01.properties and s3://your-bucket-name/postgre_02.properties, we use a two-step approach to run the bootstrap script to ensure the properties files were installed after Presto were installed.

presto_pre.sh

#!/bin/bash
aws s3 cp s3://your-bucket-name/presto_bs.sh .
chmod +x presto_bs.sh
nohup ./presto_bs.sh &>/dev/null &

presto_bs.sh

#!/bin/bash
while [ ! -d /etc/presto/conf/catalog ]
do
  sleep 1
done
cd /etc/presto/conf/catalog
sudo aws s3 cp s3://your-bucket-name/postgres_01.properties .
sudo aws s3 cp s3://your-bucket-name/postgres_02.properties .

Launch an EMR cluster with s3://your-bucket-name/presto_pre.sh as your bootstrap script. The bash script presto_pre.sh will download presto_bs.sh and execute it in the background. After Presto is properly installed (directory /etc/presto/conf/catalog becomes available), presto_bs.sh will then download the two connectors from your S3 bucket.

The following scripts demonstrate how to setup different Presto run-time parameters on the coordinator and worker nodes. Our goal is to modify /etc/presto/conf/config.properties after it becomes available and no longer modified by EMR. As such, it is desired to be able to run the second script after the presto-server process is already running.

(1) The bootstrap action script (presto_pre.sh) launches another script in the background. The bootstrap action script then exits normally, allowing the EMR cluster to continue starting up. For example:

#!/bin/bash
aws s3 cp s3://your-bucket-name/presto_bs.sh /tmp/presto_bs.sh
chmod +x /tmp/presto_bs.sh
sudo /tmp/presto_bs.sh &
exit 0

(2) The second script (presto_bs.sh) that is downloaded and started by the bootstrap action script runs in the background. It waits for the Presto service to be running (so that EMR will no longer modify the configuration file), then performs the corresponding actions. For example, the following script uses different query.max-memory and query.max-memory-per-node on the coordinator node and worker nodes:

#!/bin/bash
status presto-server | grep "start/running"
while [ $? != 0 ]
do
  sleep 10
  status presto-server | grep "start/running"
done

if grep -Fxq "coordinator=true" /etc/presto/conf/config.properties
then
# coordinator configuration
# query.max-memory=10GB
  sed -i '/query.max-memory/s/=.*/=10GB/' /etc/presto/conf/config.properties
else
# worker configuration
# query.max-memory=30GB
  sed -i '/query.max-memory/s/=.*/=30GB/' /etc/presto/conf/config.properties
fi

stop presto-server
start presto-server

In the ideal case, we may want to wait for the EMR cluster to enter WAITING state before implementing further customizations. This allows the EMR cluster to finish installing and configuring Hadoop and other applications as expected. On the EMR nodes, there is a configuration file /mnt/var/lib/info/job-flow.json that contains the EMR cluster id. We can usethe following steps to achieve what we want:

Get the EMR cluster id from /mnt/var/lib/info/job-flow.json
Get the status of the EMR cluster with the AWS CLI
If the status is not WAITING, sleep for 2 minutes and try again. The reason is it usually takes more than 10 minutes for the EMR cluster to become WAITING. Therefore, it is not necessary to poll with a smaller interval.
When the status becomes WAITING, use chmod to modify the permissions.

The whole logic can be implemented in the following bash script (in script B):

#!/bin/bash
#
# Wait for EMR provisioning to become successful.
#
while [ $(sed '/localInstance {/{:1; /}/!{N; b1}; /nodeProvision/p}; d' /emr/instance-controller/lib/info/job-flow-state.txt ](/qyjohn/AWS_Tutorials/wiki/sed-'/nodeProvisionCheckinRecord-{/{:1;-/}/!{N;-b1};-/status/p};-d'-|-awk-'/SUCCESSFUL/'-|-xargs)-!=-"status:-SUCCESSFUL"-);
do
  sleep 1
done
#
# Now the EMR cluster is ready. Do your work here.
#

exit 0