Analyzing Big Data with Amazon EMR - nzsaurabh/hadoop_training GitHub Wiki
Getting Started
Create an Amazon EMR cluster using Quick Create options in the AWS Management Console.
Submit a Hive script as a step to process data stored in Amazon Simple Storage Service (Amazon S3).
Remeber that charges accrue for cluster instances at the per-second rate for Amazon EMR pricing and for the storage of query output files that you store in Amazon S3.
The setup
specify an Amazon S3 bucket and folder to store the output data from a Hive query.
folder names can't end in numbers; can't have upper case letters.
bucket names must be unique across all AWS accounts.
create EC2 key pair, depending on your operating system.
To log in to your instance, you must create a key pair, specify the name of the key pair when you launch the instance, and provide the private key when you connect to the instance.
With Windows instances, you use the private key to obtain the administrator password and then log in using RDP (Remote Desktop Protocol).
Amazon EC2 stores the public key only, and you store the private key. Anyone who possesses your private key can decrypt your login information, so it's important that you store your private keys in a secure place.
The cluster
choose Create cluster in Amazon EMR console
choose the EC2 key pair
add inbound rules that allow SSH traffic from trusted clients
SSH
Under Security and access choose the Security groups for Master link.
Choose ElasticMapReduce-master from the list.
Choose Inbound, Edit to add a new inbound rule.
Scroll down through the listing of default rules and choose Add Rule at the bottom of the list.
For Type, select SSH. For source, select My IP. This automatically adds the IP address of your client computer as the source address. If IP addresses are allocated dynamically, so you may need to periodically edit security group rules to update the IP address of trusted clients.
Choose ElasticMapReduce-slave from the list and repeat the steps above to allow SSH client access to core and task nodes from trusted clients.
Process Data By Running The Hive Script as a Step
In Amazon EMR, a step is a unit of work that contains one or more jobs.
You can also specify steps when you create a cluster,
or you could connect to the master node, create the script in the local file system, and run it using the command line, for example hive -f Hive_CloudFront.q.
Go to cluster > add step > Type Hive program > Script S3 location > Data / Input S3 location > Output S3 location > Add
After the script has run, results will be written to the Output S3 location