Amazon EC2 for beginners - Rdatatable/data.table GitHub Wiki

This explains in a minimal way how to start and use a spot instance on Amazon EC2. There are several good blog articles but I found they didn't cover some aspects in great detail. There are so many services and options provided by Amazon that I found it quite daunting at first when all I wanted was a large memory machine for a few hours every now and again. This is a wiki page so you can easily update and improve it as time passes - press Edit in the top right.

r3-xlarge: 30GB RAM, 4 cores, $0.03/hour
r3-8xlarge: 244GB RAM, 32 cores, $0.25/hour

Spot instances are cheap because, should you be outbid by someone else or spare capacity be reallocated, they can be killed at any moment by Amazon with no notice. However, in my experience (so far) that rarely happens. A spot instance is ideal for large data benchmarking and research jobs; i.e. tasks that can simply be restarted should they be killed.

What are you waiting for?

  1. Create an account (including your credit card details): http://aws.amazon.com/ec2/

  2. Get to the EC2 Management Console and bookmark it in your browser.

  3. Click Spot Requests in the left hand menu. Your screen should now look like this : image

  4. Resist temptation to click the blue "Request Spot Instances" button but click the Pricing History grey button at the top instead.

  5. Change instance type in the drop down at the top to the one you want; e.g. r3-8xlarge. You have to use another source to know how much RAM and how many cores each instance name corresponds to; e.g. http://www.ec2instances.info/. Observe history and current price. If this isn't acceptable, close price history and change the region in the drop down box in the black area at the top right of the Management Console. Then click price history again. Keep changing regions/type until you find a region/type where the price is acceptable. Each region/type combination is priced separately.

  6. Check the instance limit for the instance type you want to request by clicking on Limits in the left sidebar. If you request an instance where your limit is 0, you will get an error like "spot instance limit exceeded". In the Limits section, you can also request an increase of limits for any instance. ATTENTION: Make sure, you write in the comments that the limit increase is for 'spot instances', not for 'on demand instances'. I didn't and it was difficult to get where I wanted (although overall support was OK).

  7. Now click the blue "Request Spot Instances" button. Note that this isn't the same as the "Launch instance" button in the Instances view (although that is where we'll view the spot instance in a moment).

Step 1: (Choose an Amazon Machine Image) The Quick Start machine images are selected by default. Choose the Ubuntu one, Ubuntu Server 16.04 LTS (HVM), SSD Volume Type, 64bit. This is a brand new, blank and factory fresh Linux server. Simple. No dependencies. No software or libraries pre-installed that might be out of date.

Step 2: (Choose an Instance Type) Choose r3-8xlarge (244GB RAM and 32 cores).

Step 3: (Configure Instance Details) The maximum bid price is the only one to complete. This is the maximum you're prepared to pay per hour. Start with the current spot price from point 5 above and with knowledge of the history add some margin; e.g., if the spot price is $0.25 then I tend to bid $0.50. Should you be outbid you have no opportunity to increase your bid ... your instance will just be killed instantly.

Step 4: (Add Storage) If you are going to generate data into csv files you need to add extra disk space because by default only 8GB disk partition is created. Just change size of default 8GB to required.

Step 5: (Tag Instance) Next

Step 6: (Configure Security Group) SSH (port 22) is already open by default. It's important to add HTTP (80) and HTTPS (443) otherwise R can't download packages. Optional: In the security group name field, change "launch-wizard-1" to "R Server", then next time you can just choose "Existing security group" instead. [UPDATE 12/2017] In my case, it seems that ssh was not open by default and also apt-key was hard to get to work. In the end I added an outbound rule for "all traffic" to destination "Anywhere" and an inbound rule for "SSH" from source "anywhere".

  1. Click Review and launch

  2. Click Launch

  3. (Select an existing key pair or create a new key pair) Select "Create a new key pair". The "Key Pair name" field is just the name of the file that will be created on your local machine. A different file is needed for each Amazon region it seems. So I have "/mdowle.pem" for N.California, "/mdowleOregon.pem" etc. Enter the file name (without the .pem extension) into the field and click the "Download Key Pair" button and save it somewhere within easy reach (I save them in my home directory ~). Next time you can just "choose an existing key pair" and it will find the appropriate .pem file for that Amazon region.

  4. Check the tick box and click "Request Spot Instance" blue button.

Your request will now appear as a new line in the "Spot Requests" view. After at most a minute the status will change from yellow to a green state "active" and status "fulfilled". However, the view does not refresh automatically so you need to click the refresh button in the top right every 10 seconds or so. You can now change to the "Instances" view (INSTANCES=>Instances on the left menu) and you have a new line there as well. Your instance is now running and you are being charged per hour whether it is idle or not. Ensure you don't forget to kill any running instances when you're finished otherwise you'll get a surprise when the monthly bill arrives in your inbox. There are no time limits or warnings about running instances you may have forgotten to terminate. [UPDATE 12/2017] There is an item at the bottom of the page: "Request valid until" where you can enter a timeout date. I haven't tested that it really kills the instance, though.

  1. Select the instance (if not already selected) by ensuring the blue check box is filled in so that the grey Connect button at the top is active and click it. This doesn't really connect, it just displays a window showing you how to connect.

  2. Copy the example line from this window, for example :

ssh -i mdowle.pem ubuntu@54.67.82.235

NB: The .pem filename and the IP address will be different for you.

  1. Paste this into a shell (I paste it into my editor's shell). Either do this in the directory where you saved the .pem to or include the path to the .pem file. That's why I put the .pem files in ~ to make this easy since the shell opens in the home directory. Enter "yes" to "Are you sure you want to continue connecting (yes/no)?"

  2. You now have a prompt to a factory fresh large-memory machine. Type free -h. Type lscpu. Smile.

  3. I have the following startup script in my editor which I run by pressing F5.

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
sudo add-apt-repository 'deb  http://cran.stat.ucla.edu/bin/linux/ubuntu trusty/'
sudo apt-get update
sudo apt-get -y install r-base-core
sudo apt-get -y install libcurl4-openssl-dev    # for RCurl which devtools depends on
sudo apt-get -y install htop                    # to monitor RAM and CPU
R
options(repos = "http://cran.stat.ucla.edu")
install.packages("devtools")
require(devtools)
# Use R as normal ...

Once you're used to it, you can get to this point in under 5 minutes.

  1. Start another shell, paste in the same ssh to connect and type htop. Leave this running to monitor RAM and CPU usage on the remote instance.

  2. Type df -h and observe disk size is not large. However you have 244GB of RAM. Use ram disk by writing and reading to /dev/shm, plus that'll be very fast disk access. Even if you use 100GB of ram disk, you'll still have 140GB of RAM. Any results you want to keep, transfer them from the server to your local machine.

  3. To transfer files to and from the server :

# To copy to EC2 (final colon needed):
scp -i ~/mdowle.pem localFile.csv ubuntu@54.183.161.72:

# To copy from EC2 (final space then dot is needed):
scp -i ~/mdowle.pem ubuntu@54.183.161.72:~/remoteFile.csv .
  1. Terminate your spot instance

When you're finished, ensure to terminate your spot instance in the correct way using the Management Console. Use the Instances view (the same view you clicked the Connect button), select the instance, click Actions grey button at the top, select Instant state submenu and then Terminate.