AWS Elastic Cloud Computing (EC2) Instances - Green-Biome-Institute/AWS GitHub Wiki

Go back to GBI AWS Wiki

Elastic Cloud Computing (EC2) Instances

Alright! We’ve made it to the good stuff. Elastic Cloud Computing, or EC2, is the cerebral cortex of our bioinformatics pipeline. EC2 instances are effectively private virtual computers where you can customize your CPU, memory, storage, networking, and security capacity. You can customize these instances yourself or build off of existing instances, called Amazon Machine Images, or AMIs. AMIs are simply instances that have already been set up with an operating system (like Linux) and any software that that AMI would need to host a specific service. For our purposes, I will make these pre-set up instances with the software that we aim to use to do genome assembly. This means all you will have to do is log in, start an instance with the correct size from a pre-made GBI AMI, upload data (and know how to access it), and run the assembly software!

Here, I’ll briefly go over creating EC2 instances, assigning the correct permissions (or checking that a pre-made instance already has them), connecting to the instance, uploading software to the instance (if needed), running software on an instance, interfacing with data stored in S3, and downloading the results to your local computer/data storage location afterwards.

Creating new EC2 instances

Sign into the AWS website using your IAM role and search EC2 in the search bar at the top of the page. An option for EC2 with the description “Virtual Servers in the Cloud” will pop up, click on that and it will take you to the EC2 dashboard. This is where you can find all the information about instances that are running on the GBI AWS account. Even though there are pre-made AMIs for you, there are still steps you have to take to build an instance from one of them, which is why I am giving the full description of how to start a new one.

  1. On the left sidebar, under the dropdown menu Instances click on Instances.
  2. At the top right, click on the orange button labelled Launch instances.
  3. Select Ubuntu Server 20.04 LTS (HVM), SSD Volume Type, using 64-bit (x86) which should be the first option.
  4. Choose the instance type that reflects the needs of your computation (this will be talked about more in depth further down on this page, but for first-try purposes, the t2.micro size is free to practice with!) and then select Next: Configure Instance Details.
  5. EBS volumes are your hard drive. You will need enough storage to store the operating system and any software that is required (usually under several GB), the sequencing data you have to be assembled, and any room required by the assembler software itself. This will be optimized as we gain more experience, but for now we will be assuming that a repetitive plant genome ~1GB long will require 2 TB of storage for the assembler to run. 1 TB of EBS storage costs $80/month (and therefore ~$2.6 per TB/day or $5.20 per 2TB/day). Once you stop using the instance, if you do not terminate it and unmount the storage, even if you are not doing anything with it, it will continue to accrue costs. Remember to terminate an instance after you are completely done with it, so that the storage allocated to it no longer gets paid for! Select “Next: Add Tags.”
  6. Here you add several keytags to associate the instance to yourself, your project, and to track its costs. We are still adjusting these so they work best for us, but for the time being we will use the following five keys:
  • User - example User : your name
  • Instance purpose (description of what you will be doing on the instance) - example instance-purpose : “assembly of Arabidopsis Thaliana genome with Illumina short-reads using SOAPdenovo2 assembler”
  • Instance Operating System (name of the instance operating system or AMI title you are using) - example instance-OS: "Ubuntu Server 20.04 LTS (HVM), SSD Volume Type (x86)"
  • Instance size - example instance-size : "t2.micro" or instance-size : "r5.4xlarge"
  • Instance cost (the hourly cost of the instance when stopped and running. The mounted EBS ($0.08/1 GB per month, which can also be thought of as $0.08/31GB per hour) which will be paid whether the instance is running or stopped), and the combined amount of the running instance (find the cost for your instance here + the mounted EBS)) - example instance-cost : "STOPPED: 1000GB EBS = $0.12/hr; RUNNING: r5.24xlarge = $6.048/hr + $0.12/hr = $6.168/hr"
  1. For the security group, currently we will allow access from anywhere, so simply click Review and Launch. In the future we will set up a security protocol that only allows traffic from university-related resources/people.
  2. Now click Launch to launch the instance.
  3. This will open up a window to select key pairs. There is a public and a private keypair. The public keypair is stored by AWS and the private keypair is stored by the user. Key pairs are used as credentials to log into your account. There are two ways to go about doing this, and we haven't yet decided on the best path. Either there will be a series of keypairs created for the purpose of genome assembly that all students will have access to (so that people don't get locked out) or each student will create a new keypair for that instance when creating it. For right now, we will go with the second option, so before starting the instance, give the keypair a unique name, download it, and store it in a folder you will remember.
  4. Your instance will now start up and you can go back to the EC2 Instances dashboard to check its status. It may take a few minutes to load.

EC2 Instance Families and Generations

There are pre-made instances that are combinations of these customizations, which fall into 1 of 5 categories: General Purpose, Compute Optimized, Memory Optimized, Accelerated Computing, or Storage Optimized. General Purpose has a balanced number of vCPUs (virtual computer cores), amount of RAM, harddrive storage, and network performance. Whereas Memory Optimized will have more memory compared to the others.

Within these categories, there are then different families of instances. The general-purpose instance is called “t2.micro”. In this name, “t” stands for the family of the instance, which in this case, just means general purpose. The “2” corresponds to the generation of that instance, the higher the number means the newer the generation (which can also mean increased cost if it is “better” than the older generations). Next the “micro” stands for the size of the instance. The first link below is the documentation for these instance types and the second was a more concise explanation all the different groups available:

https://www.amazonaws.cn/en/ec2/instance-types/

https://www.logicata.com/blog/aws-ec2-everything-you-need-to-know-about-ec2-instances/

After this decision comes the decision of what size of the instance to use.

I have personally chosen and been using the r5 family for assembly computations because it is a Memory Optimized instance, meaning it has more memory allocated to the instance than other families. This is because the assemblers typically require a substantial amount of RAM to use. For assemblers that are able to thread processes on multiple cores, looking at the number of vCPUS is also important. Each vCPU on most instance types is equal to 2 physical processor cores (meaning 2 threads).

The following shows the different sizes (and costs) for the family “r5” as of January 20, 2021:

Instance name On-Demand hourly rate vCPU Memory Storage Network performance

r5.large $0.126 2 16 GiB EBS Only Up to 10 Gigabit

r5.xlarge $0.252 4 32 GiB EBS Only Up to 10 Gigabit

r5.2xlarge $0.504 8 64 GiB EBS Only Up to 10 Gigabit

r5.4xlarge $1.008 16 128 GiB EBS Only Up to 10 Gigabit

r5.8xlarge $2.016 32 256 GiB EBS Only 10 Gigabit

r5.12xlarge $3.024 48 384 GiB EBS Only 10 Gigabit

r5.16xlarge $4.032 64 512 GiB EBS Only 20 Gigabit

r5.24xlarge $6.048 96 768 GiB EBS Only 25 Gigabit

r5.metal $6.048 96 768 GiB EBS Only 25 Gigabit

For cost savings, one could use older generations of the r family (r3, r4) if they are still available, because the older generations are slightly less expensive. It does seem that they are starting to phase these out (if they haven't already) as they have also introduced r6. Another note is that the P family may be of use for future assemblies or analyses if the softwares you are using can multi-thread a large amount of their processes. Right now however, it seems like memory is the main point of failure for the softwares we have been using.

The last note regarding EC2 instance choice is the idea of "spot" instances. These are basically Amazon's way of dealing with unused hardware - Amazon offers "spot" instances that are EC2 instances with a large discount with one main drawback: they can be turned off with a 2 minute notice at anytime. This does have the opportunity to create very large savings for long computational problems, but it would require a very legitimate effort toward making the assembly or analysis software compatible with being stopped at anytime.

To create an EC2 Instance from an existing AMI

One step 3 above for creating a new instance, simply select “My AMIs” on the left sidebar. Then choose the correct pre-made AMI with the assembly or QC software you need. From here, follow the instructions as above. Now when you launch the instance, it will have all the software already uploaded. To find the specific AMI ID for any of our pre-made AMIs go to the software's respective page on this wiki.

Connecting to an instance

Let’s SSH connect to your EC2 instance to do computing!

  1. Go to the EC2 dashboard and then into the Instances dashboard by selecting Instances on the left-hand sidebar. Check your instance status and make sure it’s running.
  2. Get the public DNS name of the instance. It will change occasionally, so if you are logging in for a second time (or later) and it is not working, you should check to make sure this is correct again. It will look like this: ec2-54-67-42-117.us-west-1.compute.amazonaws.com It will change every time you stop and start it.
  3. Find the user name for your instance from the list of usernames below ("ec2-user," "ubuntu," "centos," etc.).
  4. Enable inbound SSH traffic from your IP address to your instance. This will most likely already be done (this is the security group created when we created the instance), but if you are not allowed to SSH in this is something to check. The security group must allow SSH access from your computer (this is why we currently have it set to allow access from anywhere (that has the key pairs).
  5. Locate the private key. I have this secure in a folder on my desktop. In terminal, using cd \path\to\keypair-folder, navigate into the folder containing your keypairs and press pwd. This gives you the path to the current folder. Next copy the keypair file name and add it to this path. For example:
  • PATH = /Users/flintmitchell/Desktop/AWS/instance-test/
  • name of keypair = flints-keypair.pem
  • therefore full path with keypair = /Users/flintmitchell/Desktop/AWS/instance-test/flints-keypair.pem
  1. Set the permissions of your private key. This only needs to be done once for each keypair when it is saved on your computer. If you use an SSH client on a macOS or Linux computer to connect to your Linux instance, use the following command to set the permissions of your private key file so that only you can read it: chmod 400 my-key-pair.pem Where my-key-pair.pem is your unique key pair filename. For example, one of my keypairs is named 'flints-keypair.pem.' So for me it would be chmod 400 flints-keypair.pem. ‘chmod’ is a command that sets the permissions for read, write, and execute for users trying to access it. If you do not set these permissions, then you won't be able to connect to your instance using this key pair.

  2. Now, wait for your instance to start running. This my take a couple minutes and can be checked by confirming if Instance State says Running and Status Check says 2/2 checks passed. Once this is complete, in a terminal window, use the ssh command to connect to the instance:

    ssh -i /path/my-key-pair.pem ec2-username@my-instance-public-dns-name

For example an Ubuntu test instance for me might look like: ssh -i /Users/flintmitchell/Desktop/AWS/keypairs/flints-keypair.pem [email protected]

Make sure that the ssh -i is all in lowercase. Depending on the operating system, the ec2-username will be replaced with one of the following:

For Amazon Linux 2 or the Amazon Linux AMI, the user name is ec2-user.

For a CentOS AMI, the user name is centos.

For a Debian AMI, the user name is admin.

For a Fedora AMI, the user name is ec2-user or fedora.

For an Ubuntu AMI, the user name is ubuntu.

from https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/TroubleshootingInstancesConnecting.html#TroubleshootingInstancesConnectingMindTerm

SCP

For copying data from your computer to the EC2 instance or vice versa, you can use the scp command.

scp -i /path/my-key-pair.pem /local_path/file.filename [email protected]:ec2_path/destination

for example, copying the text file sequencing_data_XY.txt from the desktop of my local computer to the to the home of an ec2 instance running Ubuntu using the keypair flints-keypair.pem for me would look like:

scp -i /Users/flintmitchell/Desktop/AWS/keypairs/flints-keypair.pem /Users/flintmitchell/Desktop/sequencing_data_XY.txt [email protected]:~/

EBS storage help

If you didn't add enough EBS storage to an instance, you can modify it by following these instructions: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/requesting-ebs-volume-modifications.html

and then to extend the newly modified storage to the instance: use lsblk or df -H to find the name of the partition that needs to be extended to include the newly allocated storage. You will be able to tell because the main "root" volume, usually "nvme0n1" for me, will have the size of the newly allocated storage, but it will have a partition with the old size of EBS storage. This, for me, has usually been "nvme0n1p1". Since we want the partition to reflect the new storage we use the command:

sudo growpart /dev/volume-name 1

ex. sudo growpart /dev/nvme0n1 1

Now, to my understanding the 'file system' is the structured collection of files that exist within any given partition (meaning the software and data we have on the instance), which needs to now be stored on the entirety of the partition we just created. To do this, use the command:

sudo resize2fs /dev/partition-name

ex. sudo resize2fs /dev/nvme0n1p1

now, using df -H, you can see that the file system is now extended to the entirety of the newly allocated storage.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/recognize-expanded-volume-linux.html


More information: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html https://docs.aws.amazon.com/cli/latest/userguide/cli-services-s3-commands.html https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/connection-prereqs.html https://dearsikandarkhan.medium.com/files-copying-between-aws-ec2-and-local-d07ed205eefa

Go back to GBI AWS Wiki