Install External Libs on Running AWS EMR cluster - isgaur/AWS-BigData-Solutions GitHub Wiki
Create an instance profile for Systems Manager
-
Open the IAM console.
-
In the navigation pane, choose Roles, and then choose the role that's associated with the EC2 instances in your EMR cluster. By default, this role is named EMR_EC2_DefaultRole.
-
On the Permissions tab, choose Attach policy.
-
On the Attach policy page, select the check box next to AmazonEC2RoleforSSM, and then choose Attach policy.
Install libraries on the master node
-
Connect to the master node using SSH.
-
Use a bash script saved to Amazon Simple Storage Service (Amazon S3) to install libraries on the master node. The script in the following example uses easy_install-3.4 to install pip. Then the script uses pip to install paramiko, nltk, scipy, scikit-learn, and pandas for the Python 3 kernel:
#!/bin/bash
sudo easy_install-3.4 pip sudo /usr/local/bin/pip3 install paramiko nltk scipy scikit-learn pandas Install libraries on the core and task nodes
-
Create a Python script to install libraries on the core and task nodes. For an example script, see Example Installing Libraries on Core Nodes of a Running Cluster in Using Libraries and Installing Additional Libraries.
-
Save the script to your local machine or an Amazon Elastic Compute Cloud (Amazon EC2) instance.
-
Run a command similar to the following to execute the script. The script take two arguments: your cluster ID and the S3 location of the bash script that you created earlier.
python sample.py j-1K48XXXXXXHCB s3://mybucket/script-ssm.sh
Note: The IAM user or role that executes the script must have appropriate permissions for Systems Manager. For more information, see Configure User Access for Systems Manager.
Ref. Bash script :
#!/bin/bash
sudo easy_install-3.4 pip sudo /usr/local/bin/pip3 install paramiko nltk scipy scikit-learn pandas
Python Script
Install Python libraries on running cluster nodes
from boto3 import client from sys import argv
try: clusterId=argv[1] script=argv[2] except: print("Syntax: librariesSsm.py [ClusterId] [S3_Script_Path]") import sys sys.exit(1)
emrclient=client('emr')
Get list of core nodes
instances=emrclient.list_instances(ClusterId=clusterId,InstanceGroupTypes=['CORE'])['Instances'] instance_list=[x['Ec2InstanceId'] for x in instances]
Attach tag to core nodes
ec2client=client('ec2') ec2client.create_tags(Resources=instance_list,Tags=[{"Key":"environment","Value":"coreNodeLibs"}])
ssmclient=client('ssm')
Download shell script from S3
command = "aws s3 cp " + script + " /home/hadoop" try: first_command=ssmclient.send_command(Targets=[{"Key":"tag:environment","Values":["coreNodeLibs"]}], DocumentName='AWS-RunShellScript', Parameters={"commands":[command]}, TimeoutSeconds=3600)['Command']['CommandId']
Wait for command to execute
import time time.sleep(15)
first_command_status=ssmclient.list_commands( CommandId=first_command, Filters=[ { 'key': 'Status', 'value': 'SUCCESS' }, ] )['Commands'][0]['Status']
second_command="" second_command_status=""
Only execute second command if first command is successful
if (first_command_status=='Success'): # Run shell script to install libraries
second_command=ssmclient.send_command(Targets=[{"Key":"tag:environment","Values":["coreNodeLibs"]}],
DocumentName='AWS-RunShellScript',
Parameters={"commands":["bash /home/hadoop/install_libraries.sh"]},
TimeoutSeconds=3600)['Command']['CommandId']
second_command_status=ssmclient.list_commands(
CommandId=first_command,
Filters=[
{
'key': 'Status',
'value': 'SUCCESS'
},
]
)['Commands'][0]['Status']
time.sleep(30)
print("First command, " + first_command + ": " + first_command_status)
print("Second command:" + second_command + ": " + second_command_status)
except Exception as e: print(e)