Introduction to Amazon Web Services (AWS) - Green-Biome-Institute/AWS GitHub Wiki

Introduction to Amazon Web Services (AWS)

AWS is a pay-as-you-go cloud-computing platform hosted by Amazon. It offers a large variety of services that can be separated generally into storage, computing, and networking. With database centers around the world, it has effectively unlimited on-demand compute and memory capabilities. Common examples of AWS applications are the hosting of websites, secure storage and management of large files or databases, and the hosting of large or complex computations.

To use AWS, there are a series of services, which can be thought of as blocks. Each block has a dedicated utility. For example, S3 is AWS’s standard storage service, however there are a variety of other storage services for different purposes, such as AWS Glacier and AWS Elastic Block Storage (EBS). AWS Glacier is used mostly for the storage of data that doesn’t need immediate access. This means it will be cheaper than S3, however the latency (the delay between request of data and the download of it) will be longer. AWS EBS is storage that is directly mounted to the instance (the virtual computer we use), and has lower latency and higher in/out operations per second (IOPS) than S3. Depending how it is used, it may be slightly more expensive per second of usage, but may also reduce the amount of time it needs to be used. The user, you and I, gets to choose which storage “block,” (S3, Glacier, or EBS) to use that best fits our purpose and cost. This is just one small example, as each category (storage, computing, networking) has many “blocks'' for dozens of different types of niche services.

What we need from AWS

To create our bioinformatics pipeline, all that is required is the correct set of blocks put together so that:

There are security and administrative protocols for giving permission to users to do genome assembly while retaining a secure virtual environment.
There is a location for the data to be analyzed (this is like a virtual computer with CPUs and hard drive/RAM).
Data can be uploaded and stored in this location.
A service to check on the status of the analysis and to record information on how the server is running in case of errors.
A location for the returned data to be sent back and downloaded for us to look at.

One large benefit of cloud computing is the ability to dedicate greater amounts of the compute/storage power to do the analysis more efficiently. If the analysis software itself can allow for multiple operations to be run in parallel, a batch analysis can be hosted. This will use the same “amount” of services (and therefore the same cost), but dramatically speed up time required to do the computation. This block would come in-between (3) and (4) above.

AWS Services we can use

The AWS services we may use for the purposes mentioned above (1-5) are:

Identity and Access Management (AWS IAM) - This is how we control who can access what within the AWS account and the type of security protocols. Certain users will have administrative access, while others will have more limited access. This will most likely be Melis in charge of the account with students having limited access for uploading data and using the prior-assembled virtual services to do genome assembly/analysis. Since this is a pay as you go service, it is necessary to be careful with which tasks are created and submitted to the cloud - every second of use must be paid for.
Amazon Elastic Computing Cloud (AWS EC2) - This can be thought of as a virtual computer. EC2 servers are called “instances,” and they have dedicated computation and memory abilities. When creating EC2 instances you can let AWS decide the size and type of these abilities or you can dedicate them yourselves. Configurations are things like the type and number of processors, the amount and type of memory, the physical location (you can assign your instance to a geographic location, we will choose one on the west coast to get the best latency times), and bandwidth. We will upload the computational software we want to use on these instances, and they will interact with our storage (AWS EBS or AWS S3) to have access to our data!

An important note with EC2 is that there are discounts for certain services. One of these is using what is called “EC2 Spot Instances.” Spot instances are available when the AWS cloud has available server time that is not being used. Since it is not currently in use, they give large discounts for you to submit jobs. However they reserve the right to “pull the plug” on these discounted services with a 2-minute warning at any time. This means if we have any analysis work that can be interrupted and continued later or is very quick, this would be the best way of implementing it. This will most likely all be handled on the back-end and not by students, but I thought it would be valuable to understand.

& 5. Since we will most likely be storing our data locally, the only time we need it in the cloud is while it is being analyzed/assembled. The best storage for this will be EBS, as it has the lowest latency and is mounted directly to the EC2 instance. If in the future there is a need to simply store this data in the cloud, we would then choose between S3 and Glacier. A description of storing data in S3 is given just in case it needs to be used.
AWS CloudWatch - this is an optional service that allows us to pinpoint possible errors with the EC2 instances. It gives real-time updates on if instances are running or not.

Other possible services that aren’t priorities but might be good to know about are:

AWS Batch. As previously mentioned, this would allow the genome assembly/analysis software to run parallel, “batch” jobs (computations), making them more efficient. However this is reliant on the assembly software being able to do that.
AWS Lambda. This is a service for implementing code on the AWS Cloud. You can submit code that can be triggered by events within the cloud. For example if there is a common trend with the alignment/assembly results that we want to do something with (record it, edit it, analyze it, etc.), we can write code to do that (in a variety of languages including Python), and have that code be triggered by the genome assembly finishing.

Further understanding

If you’re sincerely interested in getting a better understanding of AWS and how to use it, there are a multitude of resources online. One is the following one-day course hosted by Amazon itself, which is free and has classes most days virtually: https://www.aws.training/SessionSearch?pageNumber=1&courseId=10012

There may also be a way to get enrolled in this program for free as a student: https://aws.amazon.com/education/awseducate/educators/ It looks like an interactive environment and course for learning the ins and outs of cloud infrastructure/computing with AWS. These are both unnecessary to look into unless you find a strong curiosity or reason!

Go back to GBI AWS Wiki