Assembling with Canu on EC2 - Green-Biome-Institute/AWS GitHub Wiki

This page will help you if you would like to run Canu to assemble a genome in AWS.

Canu on Ubuntu

If you are using an instance that is already assembled to run Canu, start at step 6. (04/02/21) The current custom EC2 Canu AMI for GBI has the ID ami-0cb69f700338c6750 and name GBI_CanuAssembler_Ubuntu_Arm_r6g.xlarge. To create an instance from this, follow the instructions on the EC2 page.

If starting from a brand new instance with no previously installed software, follow all of these steps (further help can be found on EC2 page):

Start Ubuntu Instance with a 64-bit (ARM) processor
Log in through terminal:

$ ssh -i /path/to/keypairs/keypair.pem [email protected]

example:

ssh -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem [email protected]

Install Canu dependencies

  - $ sudo apt update &&  sudo apt upgrade
  - $ sudo apt install build-essential
Check that you have perl 5.12 or newer downloaded:
  - $ perl -v
  - $ sudo apt install openjdk-8-jre-headless
  - $ sudo apt install gnuplot

Download Canu

sudo apt install canu

Make a folder to organize your data and the results that will come from the assembly:

mkdir data_folder_name

example:

mkdir my_genome_assembly

Copy the data to this new folder

From your local computer using scp:

scp -i /path/to/keypairs/keypair.pem local/path/to/data/filename.fastq [email protected]:~/data_folder_name

example:

scp -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem local_path_to_data_files/sequencing_files.fastq [email protected]:~/my_genome_assembly

Assemble your genome with Canu! Here's an example for oxford nanopore data of Lambda Phage:

canu -p lambda -d lambda-phage-ont genomesize=.485m -nanopore-raw lambda_26620_read_11_ch_126d.fast5

Downloading your results. The results and all information produced by canu (logs, documentation, etc.) is put into a folder that you name in the above command (lambda-phage-ont in that example) wherever the data is stored. So if the data is stored in ~/sequencing-data-folder then it will create a new folder within that ~/sequencing-data-folder/data-results-folder. We can once again use the scp command from step 8 (with a slight change) to copy the results to our local storage. We will also use the flag -r, which will copy through all the files in a given folder recursively (2 flags can be sent together, so -r and -i will be -ri [note, not -ir, order matters]) scp -ir keypair results-on-ec2-instance local file:

scp -ri /path/to/keypairs/keypair.pem [email protected]:~/data_folder_name/results_folder_name local/path/to/results_folder

example:

scp -ri /Users/flintmitchell/Desktop/GBI/AWS_keypairs/flints-keypair-1.pem [email protected]:~/lambda-phage-data/lambda-phage-ont /Users/flintmitchell/Desktop/GBI/Results

I will be updating this page with more information on how Canu actually works and the parameters that we can change to optimize our assemblies. For now, if you would like more information, check out the Canu documentation and github. They have a lot of great, pretty clear information: https://canu.readthedocs.io/en/latest/tutorial.html#outputs https://github.com/marbl/canu

Resources for the above steps that may help you:

https://github.com/marbl/canu/releases

Canu quick start https://canu.readthedocs.io/en/latest/quick-start.html

Apt-get (pre-installed on Ubuntu) https://help.ubuntu.com/community/AptGet/Howto

Install gcc https://askubuntu.com/questions/859256/how-to-install-gcc-7-or-clang-4-0 apt-get install -y gcc-7

Check if perl is installed https://www.perl.org/get.html#unix_like perl -v

Install java SE 8 https://www.digitalocean.com/community/tutorials/how-to-install-java-with-apt-on-ubuntu-18-04 java -version sudo apt install openjdk-8-jre-headless

Install gnuplot https://www.howtoinstall.me/ubuntu/18-04/gnuplot/ sudo apt install gnuplot Since gnuplot takes quite a bit of storage (almost a GB), there is the possibility of not downloading it in order to save some room in the mounted EBS volumes. It's relatively inconsequential (1GB/month = $0.08), but a possibility if we are running lots of instances. The results, once taken off the cloud after assembly can be graphed on a local computer.

Download canu sudo apt install canu

Download awscli so I can use S3 buckets sudo apt install awscli

Number of cores = number of threads https://forums.aws.amazon.com/thread.jspa?threadID=25011

Go back to GBI AWS Wiki