Home - Green-Biome-Institute/AWS GitHub Wiki
Welcome to the AWS wiki!
Here you will find information on the following:
- Using Amazon Web Services for the purpose of genome assembly,
- Plant de novo genome assembly using short read, long read, and hybrid short/long read genome assemblers,
- Genome assembly QC softwares and metrics,
- Comparisons of a variety of genome assemblers,
- Tutorials for the command line and implementing genome assembly,
- Literature related to plant genetics, de novo genome assembly, assembly in the cloud, and assembly QC!
Tutorials / Trainings:
- Command Line Tutorials!
- Amazon Web Services Tutorials!
- Assembly Tutorial (Using a Chloroplast Genome)!
- Trimming, Filtering, Genome Size Estimation!
Informational Content:
- Introduction to Amazon Web Services (AWS)
- Identity and Account Management (IAM) users and logging into the Green Biome Institute AWS account
- Organizing users
- Budgeting resources
- Navigating the AWS Website
- Tagging resources
- Metadata of AWS resources
- Cost allocation / monitoring
- Billing, cost monitoring, and budgets
- AWS Elastic Cloud Computing (EC2) Instances
- Creating a new instance, reviewing the existing instances
- Connecting to an instance
- EC2 Connect
- SSH / CLI (Command Line Interface)
- Uploading software to an AWS Instance
- Running software in an AWS Instance
- Connecting to the S3 bucket with the relevant data
- Where to find/download the results
- Allocating more EBS storage to an existing EC2 instance
- Data Storage with AWS S3
- Creating S3 buckets
- Uploading data
- CLI commands
- How do we store data at the GBI?
- Recording time, CPU, and memory usage while running a program
- Filetypes and File Compression
- Having processes continue to run after leaving the command line interface / terminal.
- Using the SRA toolkit for downloading SRA files
- SNS messages for alerts
- Walkthrough of assembly pipeline
- Estimating what k-mer parameter to use for a de bruijn graph based assembly (ABySS, Soapdenovo, Velvet)
- Checking what old AWS EBS snapshots contain (and deleting them)
- Chloroplast and Mitochondrial Assemblies from reads
Genome Assembly!
Per Assemblathon 2(1), "An assembler may produce an excellent assembly when judged by one approach, but a much poorer assembly when judged by another... Even when an assembler performs well across a range of metrics in one species, it is no guarantee that this assembler will work as well with a different genome... Comparisons between the performance of the same assembler in different species are confounded by the different nature of the input sequence data that was provided for each species." Therefore it is important that we gain experience with multiple different assemblers and find areas they work and benefit each other - this being especially true in plant genomics where there is substantially less information to go off of than other fields of genomics. Here are pages for implementing a variety of different de novo genome assemblers in AWS (Note, there are many more assemblers on the GBI AMI and documentation for them on this github, but to make it simpler, these appear to me the ones we are most likely going to use):
- Short read:
- Long read:
- Mixed short/long read
- Assembly Polishing
- Racon
- Medaka
- Assembly QC tools
- Annotation tools
- RepeatMasker
- Augustus
- MAKER
- BLAST
- Final Checks
- Stopping instances and unmounting storage
- Deleting S3 buckets
- Making sure nothing is running/being paid for
Relevant softwares, papers, and resources
(1) Bradnam, K. R., Fass, J. N., Alexandrov, A., Baranay, P., Bechner, M., Birol, I., Boisvert, S., Chapman, J. A., Chapuis, G., Chikhi, R., Chitsaz, H., Chou, W. C., Corbeil, J., Fabbro, C. Del, Docking, R. R., Durbin, R., Earl, D., Emrich, S., Fedotov, P., … Korf, I. F. (2013). Assemblathon 2: Evaluating de novo methods of genome assembly in three vertebrate species. GigaScience, 2(1), 1–31. https://doi.org/10.1186/2047-217X-2-10