Assembly with String Graph Assembler (SGA) - Green-Biome-Institute/AWS GitHub Wiki
String Graph Assembler
I am including some documentation on the String Graph Assembler, though I’m not going to dive too deep. It will probably not be one we use often, however I think it serves a good purpose as a short read input-data assembler that does not use De Bruijn graphs and is a good example of subprograms, which all the assemblers use. Assemblers generally are written/made as a series of subprograms, which run consecutively. These are sub-processes like pre-processing of data, creation of contigs, scaffolding, polishing, etc.
The subprograms for SGA are:
preprocess - Prepare a set of sequence reads for assembly
index - Build the FM index for a set of sequence reads
merge - Merge two indices together. This can be used to build a distributed indexing pipeline.
overlap - Find overlaps between reads to construct a string graph
fm-merge - Efficiently merge reads that can be unambiguously assembled
correct - Correct base calling errors in a set of reads
filter - Remove duplicate and low quality sequences
assemble - Construct contigs from a string graph
As you can see, each subprogram has a dedicated purpose. Though it may prove not possible to use all of these modules with data processed by other assemblers that use a different algorithm (SGA would not be able to construct contigs from a De Bruijn graph, like that of SOAPdenovo2), each of these modules can be thought of as an individual tool for you. For example if you type
sga preprocess --help
You can see that within it there are options for:
Conversions/Filtering:
--phred64 convert quality values from phred-64 to phred-33.
--discard-quality do not output quality scores
-q, --quality-trim=INT perform Heng Li's BWA quality trim algorithm.
Reads are trimmed according to the formula:
argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT
where l is the original read length.
-f, --quality-filter=INT discard the read if it contains more than INT low-quality bases.
Bases with phred score <= 3 are considered low quality. Default: no filtering.
The filtering is applied after trimming so bases removed are not counted.
Do not use this option if you are planning to use the BCR algorithm for indexing.
-m, --min-length=INT discard sequences that are shorter than INT
this is most useful when used in conjunction with --quality-trim. Default: 40
-h, --hard-clip=INT clip all reads to be length INT. In most cases it is better to use
the soft clip (quality-trim) option.
--permute-ambiguous Randomly change ambiguous base calls to one of possible bases.
If this option is not specified, the entire read will be discarded.
-s, --sample=FLOAT Randomly sample reads or pairs with acceptance probability FLOAT.
--dust Perform dust-style filtering of low complexity reads.
--dust-threshold=FLOAT filter out reads that have a dust score higher than FLOAT (default: 4.0).
--suffix=SUFFIX append SUFFIX to each read ID
Meaning you could use the module ‘sga preprocess
’ to trim your input fastq data, filter it by discarding reads with a low quality score (phred) or a certain minimum length, change ambiguous base calls to one of the four possible bases, and do a dust-style filtering of low complexity reads (to my understanding this is a component of the BLAST algorithm that masks reads which might be real, but aren’t necessarily interesting. Ex. a long stretch of T’s ‘TTTTTTTT’ is less complex than a a multi-nucleotide stretch: ‘TGCTAGCA’, so the first would be ‘masked’).
Other than individual commands from this assembler or some specific reason not to use a De Bruijn graph, there probably isn't more of a reason to use it. If you do need to, the basics are that you can either type in each consecutive command with the parameters it needs (found by using --hel
p on any given command, as show above with preprocess) requires a batch job file with all the command line commands set up so that it can run through each module consecutively.
A good example of this can be found here:
https://hpc.nih.gov/apps/sga.html
To install SGA Follow these instructions:
SGA on Ubuntu
If you are using an instance that is already assembled to run SGA, start at step 7. (06/12/2021) The current custom EC2 ABySS AMI for GBI has the ID _________ and name ________. To create an instance from this, follow the instructions on the EC2 page.
If starting from a brand new instance with no previously installed software, then follow all of these steps (further help can be found on EC2 page:
- Start Ubuntu Instance with a 64-bit (ARM) processor
- Log in through terminal:
$ ssh -i /path/to/keypairs/keypair.pem [email protected]
- example:
ssh -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem [email protected]
- Since SGA requires an old version of gcc and g++, you cannot use build-essentials:
- Update/upgrade apt-get and some normal dependencies:
- $ sudo apt-get update && sudo apt-get upgrade
- $ sudo apt-get install clang libboost-all-dev libopenmpi-dev make cmake
- Update/upgrade apt-get and some normal dependencies:
- $ sudo vim /etc/apt/sources.list
Add the following to the bottom of this file:
deb http://dk.archive.ubuntu.com/ubuntu/ xenial main
deb http://dk.archive.ubuntu.com/ubuntu/ xenial universe
Then press esc, type `wq!` and press `enter`.
- $ sudo apt install g++-5 gcc-5
- Download and set up Google SparseHash Library
- $ git clone https://github.com/sparsehash/sparsehash.git
- $ cd sparsehash/
- /sparsehash$ ./configure./configure
- /sparsehash$ make
- /sparsehash$ make check
- /sparsehash$ sudo make install
- /sparsehash$ make installcheck
- /sparsehash$ cd
# add Sparsehash to the PATH
- $ cd ../../usr/local/include
- /usr/local/include$ export PATH=$PATH:$(pwd)
- Download and set up Bamtools
- $ cd
- $ git clone git://github.com/pezmaster31/bamtools.git
- $ cd bamtools/
- /bamtools$ mkdir build
- /bamtools$ cd build
- /bamtools/build$ cmake -D CMAKE_INSTALL_PREFIX=/usr/local/include/ ..
- /bamtools/build$ make
- /bamtools/build$ sudo make DESTDIR=/usr/local/include/ install
- Download and set up Jemmaloc
- $ cd
- $ git clone https://github.com/jemalloc/jemalloc.git
- $ cd jemalloc/
- /jemalloc$ ./autogen.sh
- /jemalloc$ make
- /jemalloc$ sudo make install
- Install String Graph Assembler (SGA)
- $ cd
- $ git clone https://github.com/jts/sga.git
- $ cd sga/src
- /sga/src$ ./autogen.sh
- /sga/src$ ./configure --with-sparsehash=/home/ubuntu/ --with-bamtools=/usr/local/include/usr/local/include/ --with-jemalloc=/home/ubuntu/jemalloc/lib
- /sga/src$ make
- /sga/src$ sudo make install
- Make a folder to organize your data and the results that will come from the assembly:
mkdir data_folder_name
- example:
mkdir my_genome_assembly
- Copy the data to this new folder
- From your local computer using scp:
scp -i /path/to/keypairs/keypair.pem local/path/to/data/filename.fastq [email protected]:~/data_folder_name
- example:
scp -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem local_path_to_data_files/sequencing_files.fastq [email protected]:~/my_genome_assembly
Now to use it, please reference the discussion at the top of this github page.