Install External Libs on Running AWS EMR cluster - isgaur/AWS-BigData-Solutions GitHub Wiki

Create an instance profile for Systems Manager

  1. Open the IAM console.

  2. In the navigation pane, choose Roles, and then choose the role that's associated with the EC2 instances in your EMR cluster. By default, this role is named EMR_EC2_DefaultRole.

  3. On the Permissions tab, choose Attach policy.

  4. On the Attach policy page, select the check box next to AmazonEC2RoleforSSM, and then choose Attach policy.

Install libraries on the master node

  1. Connect to the master node using SSH.

  2. Use a bash script saved to Amazon Simple Storage Service (Amazon S3) to install libraries on the master node. The script in the following example uses easy_install-3.4 to install pip. Then the script uses pip to install paramiko, nltk, scipy, scikit-learn, and pandas for the Python 3 kernel:

#!/bin/bash

sudo easy_install-3.4 pip sudo /usr/local/bin/pip3 install paramiko nltk scipy scikit-learn pandas Install libraries on the core and task nodes

  1. Create a Python script to install libraries on the core and task nodes. For an example script, see Example Installing Libraries on Core Nodes of a Running Cluster in Using Libraries and Installing Additional Libraries.

  2. Save the script to your local machine or an Amazon Elastic Compute Cloud (Amazon EC2) instance.

  3. Run a command similar to the following to execute the script. The script take two arguments: your cluster ID and the S3 location of the bash script that you created earlier.

python sample.py j-1K48XXXXXXHCB s3://mybucket/script-ssm.sh

Note: The IAM user or role that executes the script must have appropriate permissions for Systems Manager. For more information, see Configure User Access for Systems Manager.

Ref. Bash script :

#!/bin/bash

sudo easy_install-3.4 pip sudo /usr/local/bin/pip3 install paramiko nltk scipy scikit-learn pandas

Python Script

Install Python libraries on running cluster nodes

from boto3 import client from sys import argv

try: clusterId=argv[1] script=argv[2] except: print("Syntax: librariesSsm.py [ClusterId] [S3_Script_Path]") import sys sys.exit(1)

emrclient=client('emr')

Get list of core nodes

instances=emrclient.list_instances(ClusterId=clusterId,InstanceGroupTypes=['CORE'])['Instances'] instance_list=[x['Ec2InstanceId'] for x in instances]

Attach tag to core nodes

ec2client=client('ec2') ec2client.create_tags(Resources=instance_list,Tags=[{"Key":"environment","Value":"coreNodeLibs"}])

ssmclient=client('ssm')

Download shell script from S3

command = "aws s3 cp " + script + " /home/hadoop" try: first_command=ssmclient.send_command(Targets=[{"Key":"tag:environment","Values":["coreNodeLibs"]}], DocumentName='AWS-RunShellScript', Parameters={"commands":[command]}, TimeoutSeconds=3600)['Command']['CommandId']

Wait for command to execute

import time time.sleep(15)

first_command_status=ssmclient.list_commands( CommandId=first_command, Filters=[ { 'key': 'Status', 'value': 'SUCCESS' }, ] )['Commands'][0]['Status']

second_command="" second_command_status=""

Only execute second command if first command is successful

if (first_command_status=='Success'): # Run shell script to install libraries

second_command=ssmclient.send_command(Targets=[{"Key":"tag:environment","Values":["coreNodeLibs"]}],
  DocumentName='AWS-RunShellScript',
  Parameters={"commands":["bash /home/hadoop/install_libraries.sh"]}, 
  TimeoutSeconds=3600)['Command']['CommandId']

second_command_status=ssmclient.list_commands(
  CommandId=first_command,
  Filters=[
      {
          'key': 'Status',
          'value': 'SUCCESS'
      },
  ]
)['Commands'][0]['Status']
time.sleep(30)
print("First command, " + first_command + ": " + first_command_status)
print("Second command:" + second_command + ": " + second_command_status)

except Exception as e: print(e)