Assembling with wtdbg2 on EC2 - Green-Biome-Institute/AWS GitHub Wiki
This page will help you if you would like to run Raven to assemble a genome in AWS.
Wtdbg2 on Ubuntu 20.04
Note: Using a Linux-based OS, we need to use an ‘x86’ architecture. And remember, if you ever have any questions that aren't answered on here, the softwares homepage/github will usually have documentation regarding its use. Wtdbg2 specifically has some great information on its github. There is a link in the resources at the bottom of this page.
If you are using an instance that is already assembled to run wtdbg2, start at step 9. (04/20/21) The current custom EC2 wtdbg2 AMI for GBI has the ID ami-06d73133d1e76ae21
and name GBI_ wtdbg2Assembler_Ubuntu_x86_r5.xlarge
. To create an instance from this, follow the instructions on the EC2 page.
If starting a brand new instance without wtdbg2 and its dependencies uploaded already, go the EC2 page to build and launch the new instance, then follow all of these steps:
- Start Ubuntu Server 20.04 Instance with a 64-bit (x86) processor.
- Log in through terminal:
ssh -i /path/to/keypairs/keypair.pem [email protected]
- Ex.:
ssh -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem [email protected]
- If you are using S3, download awscli to gain access to S3 storage buckets:
sudo apt install awscli
- Make a folder to organize your data and the results that will come from the assembly:
mkdir data_folder_name
- example:
mkdir my_genome_assembly
- Copy data from local or S3 to your data folder:
- S3:
aws s3 cp s3://[bucket-name]/[desired-file] [path/to/instance/location]
- SCP:
scp -i /path/my-key-pair.pem /local_path/file.filename [email protected]:ec2_path/destination
- Before downloading wtdbg2, update apt and some of wtdbg2's dependencies on the instance:
- Update apt:
sudo apt update
- Upgrade apt:
sudo apt upgrade
- Install gcc:
sudo apt install gcc
- Install make:
sudo apt install make
- Install zlib:
sudo apt install zlib1g
- Update git:
sudo apt install git
- Download wtdbg2 using git and set it up
git clone https://github.com/ruanjue/wtdbg2
cd wtdbg2 && make
- Make sure you are in the wtdbg2 folder (
cd wtdbg2
) and then add it to your path using:
- PATH=$PATH:$(pwd)
- Now you will be able to use the commands
wtdbg2
andwtpoa-cns
to do assembly and consensus
- Wtdbg2 will save the files wherever your command line interface currently is, so make a folder for the results and enter it:
mkdir [result-foldername] && cd [result-foldername]
- According to the documentation, "wtdbg2 has two key components: an assembler wtdbg2 and a consenser wtpoa-cns. Executable wtdbg2 assembles raw reads and generates the contig layout and edge sequences in a file "prefix.ctg.lay.gz". Executable wtpoa-cns takes this file as input and produces the final consensus in FASTA."
- So to do an assembly you will first invoke the
wtdbg2
command. To start, there are several command flags that we need to use to optimize the assembly. If you want a full list of them, simply type inwtdbg2
with nothing after it.-x ont
has some pre-set parameters, for us we will useont
for Oxford Nanopore data, but there are others for Pacbio reads. Next is-g [genome length]
which gives the estimated genome size (use suffixes k, m, g for Kb, Mb, and Gb, respectively).-t [# threads]
gives the number of threads available for the assembly or consensus (entering0
will allocate all available threads, otherwise give the maximum allowed in your EC2 instance).-i reads.FASTA
gives wtdbg2 the location of your data file (this can also take gzipped files).-fo [prefix]
gives the prefix for all the result filenames (if you putecoli-assembly
, all your result files will start withecoli-assembly
). So putting it all together you get:- wtdbg2 -x [pre-set parameters] -g [genome-size] -t [#threads] -i [path/to/filename.filetype] -fo [prefix-for-results]
- ex for ONT sequencing data for an ecoli assembly (approximately 4.6Mb) long using an EC2 instance with 32 threads:
wtdbg2 -x ont -g 4.6m -t 32 -i ~/data/my-ont-ecoli-data/ecoli-sequencing-data.FASTA -fo my-ecoli-assembly
- ex for ONT sequencing data for an ecoli assembly (approximately 4.6Mb) long using an EC2 instance with 32 threads:
- wtdbg2 -x [pre-set parameters] -g [genome-size] -t [#threads] -i [path/to/filename.filetype] -fo [prefix-for-results]
- Next we will invoke the
wtpoa-cn
command to generate the final consensus FASTA. Using the same-t
,-i
, and-fo
flags, this is what it looks like:wtpoa-cns -t [#threads] -i [path/to/resultsfolder/prefix-for-results.ctg.lay.gz -fo prefix-for-results.ctg.fa
- example for the same ecoli assembly above
wtpoa-cns -t 32 -i ~/ecoli-assembly-results/my-ecoli-assembly.ctg.lay.gz -fo my-ecoli-assembly.ctg.fa
- example for the same ecoli assembly above
- Downloading your results.
- As mentioned above, results from wtdbg2 are saved at the PATH your CLI is currently at. We can once again use the scp command from step 5 (with a slight change) to copy the results to our local storage. We will also use the flag
-r
, which will copy through all the files in a given folder recursively (2 flags can be sent together, so-r
and-i
will be-ri
[note, not-ir
, order matters]) scp, -ir, keypair, results-on-ec2-instance, local file: scp -ri /path/to/keypairs/keypair.pem [email protected]:~/data_folder_name/results_folder_name local/path/to/results_folder
- Ex.
scp -ri /Users/flintmitchell/Desktop/GBI/AWS_keypairs/flints-keypair-1.pem [email protected]:~/ecoli-data/ecoli-ont /Users/flintmitchell/Desktop/GBI/Results
- Ex.
Just like with the other assemblers, I will be updating this page with more information on how wtdbg2 actually works and more about the parameters that we can change to optimize our assemblies.
Resources for the above steps that may help you: