Assembling with wtdbg2 on EC2 - Green-Biome-Institute/AWS GitHub Wiki

This page will help you if you would like to run Raven to assemble a genome in AWS.

Wtdbg2 on Ubuntu 20.04

Note: Using a Linux-based OS, we need to use an ‘x86’ architecture. And remember, if you ever have any questions that aren't answered on here, the softwares homepage/github will usually have documentation regarding its use. Wtdbg2 specifically has some great information on its github. There is a link in the resources at the bottom of this page.

If you are using an instance that is already assembled to run wtdbg2, start at step 9. (04/20/21) The current custom EC2 wtdbg2 AMI for GBI has the ID ami-06d73133d1e76ae21 and name GBI_ wtdbg2Assembler_Ubuntu_x86_r5.xlarge. To create an instance from this, follow the instructions on the EC2 page.

If starting a brand new instance without wtdbg2 and its dependencies uploaded already, go the EC2 page to build and launch the new instance, then follow all of these steps:

Start Ubuntu Server 20.04 Instance with a 64-bit (x86) processor.
Log in through terminal: ssh -i /path/to/keypairs/keypair.pem [email protected]

Ex.: ssh -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem [email protected]

If you are using S3, download awscli to gain access to S3 storage buckets: sudo apt install awscli
Make a folder to organize your data and the results that will come from the assembly: mkdir data_folder_name

example: mkdir my_genome_assembly

Copy data from local or S3 to your data folder:

S3: aws s3 cp s3://[bucket-name]/[desired-file] [path/to/instance/location]
SCP: scp -i /path/my-key-pair.pem /local_path/file.filename [email protected]:ec2_path/destination

Before downloading wtdbg2, update apt and some of wtdbg2's dependencies on the instance:

Update apt: sudo apt update
Upgrade apt: sudo apt upgrade
Install gcc: sudo apt install gcc
Install make: sudo apt install make
Install zlib: sudo apt install zlib1g
Update git: sudo apt install git

Download wtdbg2 using git and set it up

git clone https://github.com/ruanjue/wtdbg2
cd wtdbg2 && make

Make sure you are in the wtdbg2 folder (cd wtdbg2) and then add it to your path using:

PATH=$PATH:$(pwd)
Now you will be able to use the commands wtdbg2 and wtpoa-cns to do assembly and consensus

Wtdbg2 will save the files wherever your command line interface currently is, so make a folder for the results and enter it:

mkdir [result-foldername] && cd [result-foldername]

According to the documentation, "wtdbg2 has two key components: an assembler wtdbg2 and a consenser wtpoa-cns. Executable wtdbg2 assembles raw reads and generates the contig layout and edge sequences in a file "prefix.ctg.lay.gz". Executable wtpoa-cns takes this file as input and produces the final consensus in FASTA."

So to do an assembly you will first invoke the wtdbg2 command. To start, there are several command flags that we need to use to optimize the assembly. If you want a full list of them, simply type in wtdbg2 with nothing after it. -x ont has some pre-set parameters, for us we will use ont for Oxford Nanopore data, but there are others for Pacbio reads. Next is -g [genome length] which gives the estimated genome size (use suffixes k, m, g for Kb, Mb, and Gb, respectively). -t [# threads] gives the number of threads available for the assembly or consensus (entering 0 will allocate all available threads, otherwise give the maximum allowed in your EC2 instance). -i reads.FASTA gives wtdbg2 the location of your data file (this can also take gzipped files). -fo [prefix] gives the prefix for all the result filenames (if you put ecoli-assembly, all your result files will start with ecoli-assembly). So putting it all together you get:
- wtdbg2 -x [pre-set parameters] -g [genome-size] -t [#threads] -i [path/to/filename.filetype] -fo [prefix-for-results]
  - ex for ONT sequencing data for an ecoli assembly (approximately 4.6Mb) long using an EC2 instance with 32 threads: wtdbg2 -x ont -g 4.6m -t 32 -i ~/data/my-ont-ecoli-data/ecoli-sequencing-data.FASTA -fo my-ecoli-assembly
Next we will invoke the wtpoa-cn command to generate the final consensus FASTA. Using the same -t, -i, and -fo flags, this is what it looks like:
- wtpoa-cns -t [#threads] -i [path/to/resultsfolder/prefix-for-results.ctg.lay.gz -fo prefix-for-results.ctg.fa
  - example for the same ecoli assembly above wtpoa-cns -t 32 -i ~/ecoli-assembly-results/my-ecoli-assembly.ctg.lay.gz -fo my-ecoli-assembly.ctg.fa

Downloading your results.

As mentioned above, results from wtdbg2 are saved at the PATH your CLI is currently at. We can once again use the scp command from step 5 (with a slight change) to copy the results to our local storage. We will also use the flag -r, which will copy through all the files in a given folder recursively (2 flags can be sent together, so -r and -i will be -ri [note, not -ir, order matters]) scp, -ir, keypair, results-on-ec2-instance, local file:
scp -ri /path/to/keypairs/keypair.pem [email protected]:~/data_folder_name/results_folder_name local/path/to/results_folder
- Ex. scp -ri /Users/flintmitchell/Desktop/GBI/AWS_keypairs/flints-keypair-1.pem [email protected]:~/ecoli-data/ecoli-ont /Users/flintmitchell/Desktop/GBI/Results

Just like with the other assemblers, I will be updating this page with more information on how wtdbg2 actually works and more about the parameters that we can change to optimize our assemblies.

Resources for the above steps that may help you:

https://github.com/ruanjue/wtdbg2

Go back to GBI AWS Wiki