Assembling with Canu on EC2 - Green-Biome-Institute/AWS GitHub Wiki
This page will help you if you would like to run Canu to assemble a genome in AWS.
Canu on Ubuntu
If you are using an instance that is already assembled to run Canu, start at step 6. (04/02/21) The current custom EC2 Canu AMI for GBI has the ID ami-0cb69f700338c6750
and name GBI_CanuAssembler_Ubuntu_Arm_r6g.xlarge
. To create an instance from this, follow the instructions on the EC2 page.
If starting from a brand new instance with no previously installed software, follow all of these steps (further help can be found on EC2 page):
- Start Ubuntu Instance with a 64-bit (ARM) processor
- Log in through terminal:
$ ssh -i /path/to/keypairs/keypair.pem [email protected]
example:
ssh -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem [email protected]
- Install Canu dependencies
- $ sudo apt update && sudo apt upgrade
- $ sudo apt install build-essential
Check that you have perl 5.12 or newer downloaded:
- $ perl -v
- $ sudo apt install openjdk-8-jre-headless
- $ sudo apt install gnuplot
- Download Canu
sudo apt install canu
- Make a folder to organize your data and the results that will come from the assembly:
mkdir data_folder_name
example:
mkdir my_genome_assembly
- Copy the data to this new folder
- From your local computer using scp:
scp -i /path/to/keypairs/keypair.pem local/path/to/data/filename.fastq [email protected]:~/data_folder_name
example:
scp -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem local_path_to_data_files/sequencing_files.fastq [email protected]:~/my_genome_assembly
- Assemble your genome with Canu! Here's an example for oxford nanopore data of Lambda Phage:
canu -p lambda -d lambda-phage-ont genomesize=.485m -nanopore-raw lambda_26620_read_11_ch_126d.fast5
- Downloading your results. The results and all information produced by canu (logs, documentation, etc.) is put into a folder that you name in the above command (
lambda-phage-ont
in that example) wherever the data is stored. So if the data is stored in~/sequencing-data-folder
then it will create a new folder within that~/sequencing-data-folder/data-results-folder
. We can once again use the scp command from step 8 (with a slight change) to copy the results to our local storage. We will also use the flag-r
, which will copy through all the files in a given folder recursively (2 flags can be sent together, so-r
and-i
will be-ri
[note, not-ir
, order matters]) scp -ir keypair results-on-ec2-instance local file:
scp -ri /path/to/keypairs/keypair.pem [email protected]:~/data_folder_name/results_folder_name local/path/to/results_folder
example:
scp -ri /Users/flintmitchell/Desktop/GBI/AWS_keypairs/flints-keypair-1.pem [email protected]:~/lambda-phage-data/lambda-phage-ont /Users/flintmitchell/Desktop/GBI/Results
I will be updating this page with more information on how Canu actually works and the parameters that we can change to optimize our assemblies. For now, if you would like more information, check out the Canu documentation and github. They have a lot of great, pretty clear information: https://canu.readthedocs.io/en/latest/tutorial.html#outputs https://github.com/marbl/canu
Resources for the above steps that may help you:
https://github.com/marbl/canu/releases
Canu quick start https://canu.readthedocs.io/en/latest/quick-start.html
Apt-get (pre-installed on Ubuntu) https://help.ubuntu.com/community/AptGet/Howto
Install gcc
https://askubuntu.com/questions/859256/how-to-install-gcc-7-or-clang-4-0
apt-get install -y gcc-7
Check if perl is installed
https://www.perl.org/get.html#unix_like
perl -v
Install java SE 8
https://www.digitalocean.com/community/tutorials/how-to-install-java-with-apt-on-ubuntu-18-04
java -version
sudo apt install openjdk-8-jre-headless
Install gnuplot
https://www.howtoinstall.me/ubuntu/18-04/gnuplot/
sudo apt install gnuplot
Since gnuplot takes quite a bit of storage (almost a GB), there is the possibility of not downloading it in order to save some room in the mounted EBS volumes. It's relatively inconsequential (1GB/month = $0.08), but a possibility if we are running lots of instances. The results, once taken off the cloud after assembly can be graphed on a local computer.
Download canu
sudo apt install canu
Download awscli so I can use S3 buckets
sudo apt install awscli
Number of cores = number of threads https://forums.aws.amazon.com/thread.jspa?threadID=25011