Section 4: Amazon Machine Images (AMI), the GBI AMI, and the GBI AWS Github as a Resource - Green-Biome-Institute/AWS GitHub Wiki
Go back to to Section 3: Simple Storage Service (S3) and the AWS CLI
Learning Points for Section 4: Amazon Machine Images (AMI), the GBI AMI, and the GBI AWS Github as a Resource
- An Amazon Machine Image is a snapshot of an EC2 instance that contains an operating system and pre-installed software/data.
- You can create an Amazon Machine Image by following the normal protocol to create an EC2 instance, but starting by choosing the correct AMI at the beginning of the process.
- The GBI AMI contains dozens of different bioinformatics-related softwares, ready for you to use!
- The GBI AWS Github has tons of information regarding all the things you have learned in these tutorials in further detail (and more!)
Amazon Machine Images (AMI)
The concept of an Amazon Machine Image, or AMI, just builds upon our experience with EC2 instances. The reason I mention this at the end is because I think it leads into the conclusion of this module quite well, which is the platform that has been created for you as students to do your own work and research on AWS!
What is an AMI?
First and foremost, what is it?
An Amazon Machine Image is just like a photograph. Except instead of the photo containing a view from a mountain, a sunset, or your dog, it's of an EC2 instance! (just as exciting, right??) This “snapshot” of an EC2 instance contains lots of information describing that instance at the time you took it.
For example, it contains the operating system you were running, the same or more (but not less!) storage capacity, the softwares you have installed, the files you’ve downloaded or worked on, and preferences you’ve set up Starting an EC2 instance from the GBI AMI
Start creating an EC2 instance like you normally would, by going to the EC2 dashboard (type EC2 into the search bar at the top and select EC2 : Virtual Servers in the Cloud
. Then select the large orange button in the top right, Launch Instances
. Now, we passed this over the first time, but you’ll see that Step 1 is actually called “Step 1: Choose an Amazon Machine Image (AMI)”! This is because the options here we are choosing from are Amazon Machine Images, just only with the operating system pre-installed (we are using Ubuntu, a Linux distribution). To choose from our own AMI’s, go to the list of options on the left side of the screen and select “My AMIs”. You’ll know see the AMIs that I have created for the GBI, ready to be started. Press “Select” next to the first one. Now you just have to follow the regular instructions for building an EC2 instance from Module 2 of this tutorial - that’s it! There is nothing more to it. When you start up this instance from an AMI, it will have all the preinstalled software and files that it was saved with. We’ll look at one specific example, the GBI AMI, at the end of this module.
The last 2 thing you need to be aware of before we move on into a full walkthrough are:
- The GBI Github
- The GBI AWS AMI
The GBI Github
First the GBI Github.
You’re clearly already aware that it exists because the tutorials we’ve been walking through are on the Github. However there is substantially more information on here than just these tutorials.
First, let’s navigate to the homepage of the GitHub repository.
https://github.com/Green-Biome-Institute/AWS
Here you can see a brief README file. A README, as you can now appreciate after the command line tutorial, is a text file, which talks about the software or GitHub repository it is in. Sometimes it is brief, like this one. Sometimes, the entirety of the manual page for a software is in the README. That said, when you look at a new software package or GitHub repository, it is helpful to look at the README and see if there is anything valuable in it for you to learn.
Next, we can see that above the README, there is a list of files. These are files that have been uploaded to the AWS GBI Github as they are relevant to the work being done here. When you clone this GitHub repository, all of these files (including the README), will be copied into the directory you are cloning the repo into. Some of the files here deal with genome assemblers that we will be using.
Let’s navigate back to the wiki for this Github repo.
https://github.com/Green-Biome-Institute/AWS/wiki
Below the tutorials, we see there are a couple dozen links to pages that deal with topics we have already discussed (like AWS services) as well as pages dedicated to softwares we have not yet mentioned. Let’s take a look into these.
First, click on the EC2 instance page.
https://github.com/Green-Biome-Institute/AWS/wiki/AWS-Elastic-Cloud-Computing-(EC2)-Instances
Scrolling around here you can see that much of the information from the tutorial is present. The can only cover the basics in this tutorial, so there is some more detailed information regarding the EC2 service here in its dedicated page.
Next, let’s navigate to the TMUX / screen page.
We only introduced TMUX in our tutorial, but there is another software that can be used to run programs behind the scenes called “screen”. In this page there is more information regarding tmux and tmux commands as well as a description of screen and how to use it.
Now, let’s move on below to the genome assembly related softwares. There are a bunch of softwares here, just in case you’d like to play around with different ones. Let’s click on one of the main assembly softwares that you might already be familiar with: ABySS2.
https://github.com/Green-Biome-Institute/AWS/wiki/Assembling-with-ABySS-on-EC2
Within the ABySS2 page, you’ll find it is broken up into a couple different sections. First, we have a brief description of ABySS and information relevant to it. Then, we have a full set of steps that show you how to build an EC2 instance, install the dependencies of ABySS onto it, install ABYSS onto it, and then use ABYSS! This includes the basic commands and some quirks of the assembler that I may have found. Then, it will tell you what the results files will look like and how to download those results off of the EC2 instance.
This is actually one of the most important things you can take away from this tutorial module: the Github is a resource for YOU! When you have questions regarding an AWS service, a CLI command, or a relevant bioinformatics software, come check out the Github and see if there is information regarding it! There will likely be something that ca point you in the right direction or information that will help you in your search for further answers.
The GBI AMI
Next, the GBI AWS AMI.
The last thing to know about is that within the GBI account, there is an EC2 AMI that is preconfigured with tons of software for you to use! This includes ALL of the following softwares in working condition with their software dependencies preinstalled. All you have to do is build an EC2 instance, just like we’ve now seen twice int his tutorial, starting with this AMI, and when you log in, you will have your own virtual computer set up to do your computations and analysis. This is the list of softwares:
- SOAPdenovo2
- MaSuRCA
- QIIME2
- VEGAN
- SPAdes
- ANGSD
- GATK
- Dada2
- ITSxpress
- CD-HIT
- BLAST
- MEGAN
- MOTHUR
- ABySS
- Canu
- Flye
- NECAT
- Shasta
- SparesAssembler
- bcftools
- Fast toolkit
- WENGAN
- BUSCO
- QUAST
- Trinity
- Bowtie2
- Picard
- samtools
- vcftools
- bamtools
- bedtools
- Braker2
- BWA
- Cap3
- Cut adapt
- Fastqc
- Hisat2
- htseq
- Plink
- Genometools
- ngstools
- Trimmomatic
- Trim_galore
- Trinotate
- Momi2
- Dadi
- ChloroExtractor
- NOVOPlasty
- CAP
- Fast-Plast
- GetOrganelle
- IOGA
- Org.Asm
- Unicycler
- Jellyfish
There are also a series of softwares relevant to transposable elements that will be uploaded and documented on this AMI as well. These will be available by Dec 1, 2021. (If it is past that date and I haven’t updated this page yet, please remind me to do so!)
Each of these softwares will be documented here on the Github and on the AMI itself with a brief description of the software, links to find more information about it, the command you can use to find further information about the software, and an example command usage to operate the software!
Review Questions
What is an Amazon Machine Image?
- A snapshot of an EC2 instance that contains an operating system and pre-installed software/data.
How do you build EC2 instances from an AMI?
- When selecting the type of operating system for your EC2 instance during the first step of the normal "Launch EC2 Instance" process, instead of choosing an OS, you select the correct AMI you want from the "AMI" tab.
What is on the GBI AMI?
- Over 60 bioinformatics softwares related to managing and analyzing sequencing data!
What is on the GBI Github wiki?
- Information regarding the GBI AMI and documentation regarding the important softwares most relevant to our work as well as for AWS itself!
Conclusion
Okay! Let’s look at a high-level review of what you now know how to do (or at least are aware of how to review and implement):
You can:
- Open up and navigate the CLI
- Execute commands and 3rd party softwares on the CLI
- Get help from the internet and man pages regarding commands and softwares within the CLI itself
- Log into and navigate around AWS
- Create private virtual computers called EC2 instances on AWS
- Store data and files as S3 objects within S3 buckets on AWS
- Understand permissions requirements of AWS and how they apply to your uses
- Understand the billing aspects of AWS and how to find out how much money will be charged to the GBI AWS account for the services you want to use before you use them
I’d say that’s pretty fantastic! In the next tutorial module we’re going to put this ALL together in one big walkthrough of genome assembly and QC, starting from just having genetic sequencing results and nothing else, to having a fully assembled and annotated genome! You are going to do it all by typing each command yourself, proving that you are entirely capable of becoming a bioinformatics master.
Onwards!
Move on to Section 5 : Setting up EC2 instance and Uploading Data