Assembling with Raven on EC2 - Green-Biome-Institute/AWS GitHub Wiki
This page will help you if you would like to run Raven to assemble a genome in AWS.
Raven on Ubuntu
Note: We are using conda to install Raven on this EC2 instance. On a Linux-based OS, we need to use an ‘x86’ system, since Anaconda is not supported on ARM processors (at least not for our purposes).
If you are using an instance that is already assembled to run Raven, start at step 7. (04/16/21) The current custom EC2 Raven AMI for GBI has the ID ami-051ac9f5c9eb884b4
and name GBI_RavenAssembler_Ubuntu_x86_r5.xlarge
. To create an instance from this, follow the instructions on the EC2 page.
If starting a brand new instance without Raven and its dependencies uploaded already, go the EC2 page to build and launch the new instance, then follow all of these steps:
- Start Ubuntu Instance with a 64-bit (x86) processor.
- Log in through terminal:
ssh -i /path/to/keypairs/keypair.pem [email protected]
- Ex.:
ssh -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem [email protected]
- Before downloading Raven, update apt and some of Raven's dependencies on the instance:
- Update apt:
sudo apt update
- Upgrade apt:
sudo apt upgrade
- Install gcc 4.8+:
sudo apt install gcc
- Install cmake 3.11+:
sudo apt install cmake
- Install clang 4.0+:
sudo apt install clang
- Update git:
sudo apt install git
- Install zlib 1.2.8+:
sudo apt install zlib1g
- Download miniconda (Anaconda Python but without unnecessary packages since this instance will only need certain packages).
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
- Enter ‘yes’ for the default settings.
- Remove the miniconda installer file:
rm Miniconda3-latest-Linux-x86_64.sh
- Enter a shell by typing:
bash
- If you did not enter yes for the miniconda assembly to initialize conda (this sets the correct PATH for your conda environment), use the following:
conda init bash
source ~/.bashrc
- Update conda:
conda update --name base conda --yes
- Download Raven with conda
conda install -c bioconda raven-assembler
- If you are using S3, download awscli to gain access to S3 storage buckets:
sudo apt install awscli
- Make a folder to organize your data and the results that will come from the assembly:
mkdir data_folder_name
- Ex.
mkdir my_genome_assembly
- Copy data from local or S3 to your data folder:
- S3:
aws s3 cp s3://[bucket-name]/[desired-file] [path/to/instance/location]
- SCP:
scp -i /path/my-key-pair.pem /local_path/file.filename [email protected]:ec2_path/destination
- Assembly!
If you simply type
raven
into the command line interface, it will output all the flags associated with the command (shown at the end of step 6).
- One flag that will always be relevant is
-t
, which dedicates the number of threads used in the computation. For example, to use 4 threads,-t 4
- Raven outputs the results in stdout, which just means it outputs in the default method, which in the case of the command line interface is the user’s screen in the cli. To put the assembly into a fasta file you can use the following command at the end of the call to use raven:
> [your-filename.fasta]
, where you can nameyour-filename
whatever you want. To do an assembly use the following command: -
raven -t [number of threads available] [path\to\data\data_filename.fastq] > [desired_assembly_filename]
- Ex.
raven -t 4 ~/lambda-data/fastq_runid_5dd3f31631aaf8b094e6dfd522b916c92d81e5ac_0.fastq > lambda0assembly.fasta
- Ex.
- Your results will be stored at the path your command line interface is currently at (if you are within a folder for your data, it will store the results in that data folder).
raven
command options:
(base) ubuntu@ip-172-31-61-45:~$ raven
usage: raven [options ...] <sequences>
# default output is to stdout in FASTA format
<sequences>
input file in FASTA/FASTQ format (can be compressed with gzip)
options:
--weaken
use larger (k, w) when assembling highly accurate sequences
-p, --polishing-rounds <int>
default: 2
number of times racon is invoked
-m, --match <int>
default: 3
score for matching bases
-n, --mismatch <int>
default: -5
score for mismatching bases
-g, --gap <int>
default: -4
gap penalty (must be negative)
--graphical-fragment-assembly <string>
prints the assembly graph in GFA format
--resume
resume previous run from last checkpoint
--disable-checkpoints
disable checkpoint file creation
-t, --threads <int>
default: 1
number of threads
--version
prints the version number
-h, --help
prints the usage
- Downloading your results.
- As mentioned above, results from Raven are saved at the PATH your CLI is currently at. We can once again use the scp command from step 8 (with a slight change) to copy the results to our local storage. We will also use the flag
-r
, which will copy through all the files in a given folder recursively (2 flags can be sent together, so-r
and-i
will be-ri
[note, not-ir
, order matters]) scp, -ir, keypair, results-on-ec2-instance, local file: -
scp -ri /path/to/keypairs/keypair.pem [email protected]:~/data_folder_name/results_folder_name local/path/to/results_folder
- Ex.
scp -ri /Users/flintmitchell/Desktop/GBI/AWS_keypairs/flints-keypair-1.pem [email protected]:~/lambda-phage-data/lambda-phage-ont /Users/flintmitchell/Desktop/GBI/Results
- Ex.
Just like with the other assemblers, I will be updating this page with more information on how Raven actually works and the parameters that we can change to optimize our assemblies. For now, if you would like more information, check out the Raven github: https://github.com/lbcb-sci/raven
Resources for the above steps that may help you:
Install gcc and G++: https://linuxize.com/post/how-to-install-gcc-on-ubuntu-20-04/ sudo apt install gcc sudo apt install g++ sudo apt install make
Download miniconda (Anaconda Python but without unnecessary packages since this instance will only need very specific requirements). Follow these instructions: https://towardsdatascience.com/managing-project-specific-environments-with-conda-b8b50aa8be0e
Number of cores = number of threads https://forums.aws.amazon.com/thread.jspa?threadID=25011