Assembling with SOAPdenovo2 on EC2 - Green-Biome-Institute/AWS GitHub Wiki
This page will help you if you would like to run SOAPdenovo2 to assemble a genome using short-read Illumina data in AWS.
SOAPdenovo2 on Ubuntu
If you are using an instance that is already assembled to run SOAPdenovo2, start at step 7. (06/12/2021) The current custom EC2 SOAPdenovo2 AMI for GBI has the ID ami-0cb69f700338c6750 and name GBI_SOAPdenovo2_Ubuntu_Arm_r6g.xlarge. To create an instance from this, follow the instructions on the EC2 page.
If starting from a brand new instance with no previously installed software, then follow all of these steps (further help can be found on EC2 page:
- Start Ubuntu Instance with a 64-bit (ARM) processor
- Log in through terminal:
ssh -i /path/to/keypairs/keypair.pem [email protected]
- example:
ssh -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem [email protected]
- Set up the basics:
- Update/upgrade apt:
sudo apt update && sudo apt upgrade
- Download the build-essentials for building/installing new softwares:
sudo apt install build-essential
- Download SOAPdenovo2:
sudo apt install soapdenovo2
- This will install 2 independent forms of the SOAPdenovo assembler,
soapdenovo2-127mer
andsoapdenovo2-63mer
. The numbers at the end, 127 and 63, respectively, correspond to the k-mer value used in assembly. For a better description of k-mers, go to the Assembly Basics page. Changing the k-mer value can dramatically change the amount of memory required by an assembly software to do the computation, so to the best of my knowledge the 127-mer software is written so that it needs/allocates more memory than the 63-mer software. Therefore implementing an assembly of the same Illumina data with all the same parameters including the k-mer value on both of the softwares should give the same result, but the 127-mer software will require more memory on the instance. If you are using a k-mer lower than 63 then you can use thesoapdenovo2-63mer
version.
- Make a folder to organize your data and the results that will come from the assembly:
mkdir data_folder_name
- example:
mkdir my_genome_assembly
- Copy the data to this new folder
- From your local computer using scp:
scp -i /path/to/keypairs/keypair.pem local/path/to/data/filename.fastq [email protected]:~/data_folder_name
- example:
scp -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem local_path_to_data_files/sequencing_files.fastq [email protected]:~/my_genome_assembly
- example:
- Using SOAPdenovo2!
- A basic command for SOAPdenovo2 looks like:
soapdenovo2-127mer all -s /path/to/config-file -K [k-mer-value] -p [num-threads] -o /path/to/assembly/directory/assembly-output-filename 1>assembly.log 2>assembly.err
.
I'll break up this command so we can look at each part of it.
- First we'll look at going to look at
-s /path/to/config-file
. SOAPdenovo2 requires a configuration file to run, and-s
tells the program that the next file you give it will be the location/name of it. The config file tells the assembler where to find the raw sequencing files (FASTA, FASTQ, or BAM) and has relevant information to them. Most of the information entered into the config file has default values, but you will want to write your own so that if reflects your experiment. The basic information to look at (and change if necessary) in this config file is:
### Global Information
# Maximum lead length. The length of the longest read in your dataset.
max_rd_len=150
### Library Information
# Multiple libraries can be submitted. Each library section starts with [LIB] and the following information
# before the next [LIB] call (if there is one) relates to that library section. (i.e. if only have 1 library,
# you only need to write [LIB] once)
[LIB]
# Average insert size of the library. You should know this from your experiment.
avg_ins=150
# This option takes value 0 or 1. It tells the assembler if the read sequences need to be complementarily reversed.
# Illumima produces two types of paired-end libraries: a) forward-reverse, generated from fragmented DNA ends with
# typical insert size less than 500 bp; b) reverse-forward, generated from circularizing libraries with typical insert
# size greater than 2 Kb. The parameter "reverse_seq" should be set to indicate this: 0, forward-reverse; 1, reverse-forward.
reverse_seq=0
# This indicator decides in which part of the assembly the reads are used.
# 1 = contig assembly only
# 2 = scaffold assembly only
# 3 = both contig and scaffold assembly
# 4 = only gap closure
asm_flags=3
# Length of the longest read in this library
rd_len_cutoff=150
# The order in which these reads are used in scaffolding. If you had a second library, you could have them
# used at the same time for scaffolding (make both have rank=1), or use one first then the other
# (one would be rank=1 and the other rank=2).
rank=1
# This parameter is the cutoff value of pair number for a reliable connection between two contigs or pre-scaffolds.
# Min for paired-end is 3. Min for mate-pair is 5.
pair_num_cutoff=3
# Minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
map_len=32
# Paired-end fastq files, read 1 file should always be followed by read 2 file.
q1=/home/ubuntu/a-thaliana-kmer-analysis/data/SRR1946554_1.fastq
q2=/home/ubuntu/a-thaliana-kmer-analysis/data/SRR1946554_2.fastq
I added some notes to the above variable descriptions from https://home.cc.umanitoba.ca/~psgendb/doc/SOAP/SOAPdenovo-Trans-UsersGuide.html. You can go there, the SOAPdenovo2 github, or my example config file for more information on these variables.
-
With the config file created, the next element of the command we can look at is
soapdenovo2-127mer all
.soapdenovo2-127mer
is the command that finds the software to implement. Then, because SOAPdenovo2 runs through a series of protocols, by typingall
, it will cycle through them, instead of being used one-by-one (though this is an option if you only need to use one module). These modules all have their own set of flags. In order to find out more information about them you can entersoapdenovo2-127mer [module-name] --help
For examplesoapdenovo2-127mer all --help
orsoapdenovo2-127mer pregraph --help
. The modules include:- pregraph: Constructs the k-mer graph. k-mers are strings of nucleotides, which are used when finding possible overlaps between sequencing reads.
- sparse_pregraph: same as the pregraph function above, except that it only uses a subset of the k-mers found in the sequencing data, therefore decreasing the amount of memory needed to do the assembly. For the time being, skip this option.
- contig: Build contigs
- map: map the reads back onto the contigs. I believe this is done to provide information on the gaps for the following step, though I might be wrong.
- scaff: Using the contigs and their estimated position, this tries to pair contigs with eachother, fill gaps between them, and position them relative to eachother, if possible.
-
-K [k-mer-value]
The-K
flag tells the software that the following number will be what k-mer value to use during the assembly. -
-p [num-threads]
The-p
flag gives the software permission to allocate this many cores on the virtual computer for the assembly. You cannot give less than 1, nor can you give more cores than the AWS EC2 instance that you are using has. -
-o /path/to/assembly/directory/assembly-output-filename 1>assembly.log 2>assembly.err
The-o
flag directs the software where to put the output files and what to name them./path/to/assembly/directory/
is the path to where you want them to be stored, andassembly-output-filename
will be the prefix of all the output files.1>assembly.log 2>assembly.err
creates 2 files with information related to how the assembler ran. From what I've gathered,assembly.log
has all the information that is printed to the terminal while the assembler is running (which includes summaries of the results like N50, contig and scaffold lengths, etc!), and2>assembly.err
has information about if the assembler errors out.
So! With all that information we can look at an example. The following command uses the 127-mer version of SOAPdenovo2 with a config file at /home/ubuntu/a-thaliana/
named a-thaliana
, allocating 16 threads on the EC2 instance, a k-mer value of 63, and outputs to the directory /home/ubuntu/a-thaliana/
, with the output file prefix a-thalia-output
with assembly logs named assembly.log
and assembly.err
:
soapdenovo2-127mer all -s /home/ubuntu/a-thaliana/a-thaliana -p 16 -K 63 -R -o /home/ubuntu/a-thaliana/a-thalia-output 1>assembly.log 2>assembly.err
- Downloading your results. The results and all information produced by SOAPdenovo2 (logs, documentation, etc.) is put into a location that you designated in the above command (
/home/ubuntu/a-thaliana/
in that example) wherever the data is stored. So if the data is stored in~/sequencing-data-folder
then it will create a new folder within that~/sequencing-data-folder/data-results-folder
. We can once again use the scp command from step 6 (with a slight change) to copy the results to our local storage. We will also use the flag-r
, which will copy through all the files in a given folder recursively (2 flags can be sent together, so-r
and-i
will be-ri
[note, not-ir
, order matters]) scp -ir keypair results-on-ec2-instance local file:
scp -ri /path/to/keypairs/keypair.pem [email protected]:~/data_folder_name/results_folder_name local/path/to/results_folder
- Example:
scp -ri /Users/flintmitchell/Desktop/GBI/AWS_keypairs/flints-keypair-1.pem [email protected]:~/home/ubuntu/a-thaliana/a-thaliana/ /Users/flintmitchell/Desktop/GBI/Results
- Example:
Resources: great documentation on using SOAPdenovo2 for genome assembly: http://manpages.ubuntu.com/manpages/xenial/man1/soapdenovo-63mer.1.html
Good presentation from UCSC on using SOAPdenovo2 https://banana-slug.soe.ucsc.edu/_media/lecture_notes:soap_team2_report.pdf