Assembling with SparseAssembler on EC2 - Green-Biome-Institute/AWS GitHub Wiki

Go back to GBI AWS Wiki

This page will help you if you would like to run SparseAssembler to assemble a genome using short-read Illumina data in AWS.

SparseAssembler on Ubuntu

"A sparse k-mer graph based, memory-efficient genome assembler." https://github.com/yechengxi/SparseAssembler

SparseAssembler is a De Bruijn Graph assembler, like ABySS and SOAPdenovo2, except for during the process of k-mer construction, a part of De Bruijn Assembly discussed on the assembly basics page, fewer of the possible k-mers are used. This is the "sparse" use of k-mers, and is intended to saves on memory usage but more importantly on the computational load. Less k-mers effectively means less pieces of information to try and pair together. Like SGA, this one hasn't been updated in a while and will probably not end up being our workhorse, but I think it's valuable to see what else is out there besides the most popular assemblers and how they've been adapted.

If you are using an instance that is already assembled to run SparseAssembler, start at step ___. The current custom EC2 ABySS AMI for GBI has the ID __________ and name ________. To create an instance from this, follow the instructions on the EC2 page.

If starting from a brand new instance with no previously installed software, then follow all of these steps (further help can be found on EC2 page:

Start Ubuntu Instance with a 64-bit (ARM) processor
Log in through terminal:

$ ssh -i /path/to/keypairs/keypair.pem [email protected]

example:

ssh -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem [email protected]

Set up the basics and dependencies:

Update/upgrade apt-get, download the build-essentials for building/installing new softwares, and clang:

  - $ sudo apt-get update &&  sudo apt-get upgrade
  - $ sudo apt-get install build-essential

Install SparseAssembler

  - $ cd
  - $ git clone https://github.com/yechengxi/SparseAssembler.git
  - $ cd SparseAssembler/
  - /SparseAssembler$ cd compiled/
  - /SparseAssembler/compiled$ chmod +x SparseAssembler

Using SparseAssembler

Since this assembler is a little less documented, I am copying and pasting the documentation tho the end after I highlight some of the important parts of the command.

The flags that seem most important are

g which is related to the "sparseness" of the k-mers, a higher value here would skip more k-mers.

This is something that would have to be optimized. The balance here is between how many k-mers are created and what percentage of those will eventually be ones that are used in contigs/further in scaffolds. More k-mers created means more memory and computing resources required to do an assembly (less k-mers would therefore require less of each). Now, when skipping possible k-mers, how does it affect your final assembly? If you skip too many, does it negatively impact the final result?

After doing some basic optimization of these parameters while performing Arabidopsis Thaliana assemblies, unfortunately I don't have any conclusive evidence for how to adjust your own use of this parameter. With the illumina A. Thaliana data I used, I had ~20x coverage, and got an N50 of ~5Kb using ABySS with k=40, 50, and 60. Here, I tried using k=40 and 60 with g (the sparseness parameter) =10,17,24 (these values appear to span the ones used in examples found via the github page). I did not see a large different in the output assembly results with g = 10, 17, or 24, yet saw a decrease in the number of k-mers used during the contig building process. Compared to the amount of k-mers g = 10 used, g = 17 used 60% and g = 24 used 46%. This is just proof of principle that the assembler did in fact lower the amount of memory required (via the # of k-mers used and therefore number of possible contigs to align) while providing similar assembly qualities.

Unfortunately I don't think this can be extrapolated to other assemblies , if you want to use this assembler to assemble a de novo genome, it would be wise to do your own sweep of the parameters to see which would be most efficient and still provide a quality result.

k gives the length of the k-mer value used when creating the De Bruijn graph.
LD this loads a saved k-mer graph and setting it to 0 means you are not loading a k-mer graph.
GS This is the estimated genome length, which is used for memory allocation. If memory isn't an issue, you might use 2x (or greater) of the estimated genome length to prevent running into issues.
NodeCovTh this is a parameter that appears to deal with setting a threshold for the length of k-mers that are considered fake. The default is 1 but I haven't played with adjusting it.
EdgeCovTh like the above parameter, this looks like it sets a threshold for identifying incorrect edges in the De Bruijn graph. Default is 0
f This is the flag that tells the assembler the following file will be sequencing data. You can use this several times like in the following example command from the github to input your illumina short-read sequencing data.

example command:

./SparseAssembler g 10 k 51 LD 0 GS 200000000 NodeCovTh 1 EdgeCovTh 0 f frag_1.fastq f frag_2.fastq f frag_3.fastq &

Documentation

ubuntu@ip-172-31-56-228:~/SparseAssembler/compiled$ ./SparseAssembler --help
Command: 
Programfile g GAP_VALUE k KMER_SIZE LD LOAD_SKG GS GENOME_SIZE TrimN TRIM_READS_WITH_N f INPUT_FILE1 f INPUT_FILE2 i1 INWARD_PAIR_END1 i2 INWARD_PAIR_END2 o1 OUTWARD_PAIR_END1 o2 OUTWARD_PAIR_END2

Parameters:
k: kmer size, support 15~127.
g: number of skipped intermediate k-mers, support 1-25.
f: single end reads. Multiple inputs shall be independently imported with this parameter.
i1 & i2 (or p1 & p2): inward paired-end reads.
o1 & o2 (or l1 & l2): outward paired-end reads.
GS: genome size estimation in bp (used for memory pre-allocation), suggest a large value if possible.(e.g. ~ 2x genome size)
LD: load a saved k-mer graph. 
BC: 1: build contigs.0: don't build.
KmerTable: 1 if you want to output the kmer table.
NodeCovTh: coverage threshold for spurious k-mers, support 0-16. (default 1)
EdgeCovTh: coverage threshold for spurious links, support 0-16. (default 0)
PathCovTh: coverage threshold for spurious paths in the breadth-first search, support 0-100.
TrimLen: trim long sequences to this length.
TrimN: throw away reads with more than this number of Ns.
TrimQual: trim off tails with quality scores lower than this.
QualBase: lowest quality score value (in ASCII value) in the current fastq scoring system, default: '!'.

For error correction:
Denoise: use 1 to call the error correction module. (default 0)
H: hybrid mode. 0 (Default): reads will be trimmed at the ends to ensure denoising accuracy (*MUST* set 0 for the last round). 1: reads will not be trimmed at the ends; 
CovTh: coverage threshold for an error. A k-mer with coverage < this value will be checked. Setting 0 will allow the program to choose a value based on the coverage histogram.
CorrTh: coverage threshold for a correct k-mer candidate. A k-mer with coverage >= this value will be considered a candidate for correction. Setting 0 will allow the program to choose a value based on the coverage histogram.

For scaffolding:
ExpCov: expected average k-mer coverage in a unique contig. Used for scaffolding.
Scaffold: 1: scaffolding with paired reads. 0: single end assembly.
LinkCovTh: coverage threshold for spurious paired-end links, support 0-100. (default 5)
Iter_Scaffold: 1: iterative scaffolding using the already built scaffolds (/super contigs). 0: one round scaffolding.
For mate pair scaffolding:
InsertSize: estimated insert size of the current pair.
i1_mp & i2_mp: inward mate paired reads (large insert sizes >10k, for shorter libraries omit "_mp").
o1_mp & o2_mp : outward paired-end reads (large insert sizes >10k, for shorter libraries omit "_mp").

Downloading your results. The results and all information produced by sparseassembler (logs, documentation, etc.) is put into a location that your command line interface is currently in. So if you are in a directory /home/ubuntu/a-thaliana/, then the output will be stored there as well. We can once again use the scp command from step 6 (with a slight change) to copy the results to our local storage. We can also use the flag -r, which will copy through all the files in a given folder recursively (2 flags can be sent together, so -r and -i will be -ri [note, not -ir, order matters]) scp -ir keypair results-on-ec2-instance local file:

scp -ri /path/to/keypairs/keypair.pem [email protected]:~/data_folder_name/results_folder_name local/path/to/results_folder

Example:

scp -ri /Users/flintmitchell/Desktop/GBI/AWS_keypairs/flints-keypair-1.pem [email protected] 2.compute.amazonaws.com:~/home/ubuntu/a-thaliana/a-thaliana/ /Users/flintmitchell/Desktop/GBI/Results

Resources

https://github.com/yechengxi/SparseAssembler

Go back to GBI AWS Wiki