Assembly with String Graph Assembler (SGA) - Green-Biome-Institute/AWS GitHub Wiki

Go back to GBI AWS Wiki

String Graph Assembler

I am including some documentation on the String Graph Assembler, though I’m not going to dive too deep. It will probably not be one we use often, however I think it serves a good purpose as a short read input-data assembler that does not use De Bruijn graphs and is a good example of subprograms, which all the assemblers use. Assemblers generally are written/made as a series of subprograms, which run consecutively. These are sub-processes like pre-processing of data, creation of contigs, scaffolding, polishing, etc.

The subprograms for SGA are:

preprocess - Prepare a set of sequence reads for assembly
index - Build the FM index for a set of sequence reads
merge - Merge two indices together. This can be used to build a distributed indexing pipeline.
overlap - Find overlaps between reads to construct a string graph
fm-merge - Efficiently merge reads that can be unambiguously assembled
correct - Correct base calling errors in a set of reads
filter - Remove duplicate and low quality sequences
assemble - Construct contigs from a string graph

source

As you can see, each subprogram has a dedicated purpose. Though it may prove not possible to use all of these modules with data processed by other assemblers that use a different algorithm (SGA would not be able to construct contigs from a De Bruijn graph, like that of SOAPdenovo2), each of these modules can be thought of as an individual tool for you. For example if you type

sga preprocess --help

You can see that within it there are options for:

Conversions/Filtering:
          --phred64                    convert quality values from phred-64 to phred-33.
          --discard-quality            do not output quality scores
      -q, --quality-trim=INT           perform Heng Li's BWA quality trim algorithm. 
                                       Reads are trimmed according to the formula:
                                       argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT
                                       where l is the original read length.
      -f, --quality-filter=INT         discard the read if it contains more than INT low-quality bases.
                                       Bases with phred score <= 3 are considered low quality. Default: no filtering.
                                       The filtering is applied after trimming so bases removed are not counted.
                                       Do not use this option if you are planning to use the BCR algorithm for indexing.
      -m, --min-length=INT             discard sequences that are shorter than INT
                                       this is most useful when used in conjunction with --quality-trim. Default: 40
      -h, --hard-clip=INT              clip all reads to be length INT. In most cases it is better to use
                                       the soft clip (quality-trim) option.
      --permute-ambiguous              Randomly change ambiguous base calls to one of possible bases.
                                       If this option is not specified, the entire read will be discarded.
      -s, --sample=FLOAT               Randomly sample reads or pairs with acceptance probability FLOAT.
      --dust                           Perform dust-style filtering of low complexity reads.
      --dust-threshold=FLOAT           filter out reads that have a dust score higher than FLOAT (default: 4.0).
      --suffix=SUFFIX                  append SUFFIX to each read ID

Meaning you could use the module ‘sga preprocess’ to trim your input fastq data, filter it by discarding reads with a low quality score (phred) or a certain minimum length, change ambiguous base calls to one of the four possible bases, and do a dust-style filtering of low complexity reads (to my understanding this is a component of the BLAST algorithm that masks reads which might be real, but aren’t necessarily interesting. Ex. a long stretch of T’s ‘TTTTTTTT’ is less complex than a a multi-nucleotide stretch: ‘TGCTAGCA’, so the first would be ‘masked’).

Other than individual commands from this assembler or some specific reason not to use a De Bruijn graph, there probably isn't more of a reason to use it. If you do need to, the basics are that you can either type in each consecutive command with the parameters it needs (found by using --help on any given command, as show above with preprocess) requires a batch job file with all the command line commands set up so that it can run through each module consecutively.

A good example of this can be found here:

https://hpc.nih.gov/apps/sga.html

To install SGA Follow these instructions:

SGA on Ubuntu


If you are using an instance that is already assembled to run SGA, start at step 7. (06/12/2021) The current custom EC2 ABySS AMI for GBI has the ID _________ and name ________. To create an instance from this, follow the instructions on the EC2 page.

If starting from a brand new instance with no previously installed software, then follow all of these steps (further help can be found on EC2 page:

  1. Start Ubuntu Instance with a 64-bit (ARM) processor
  2. Log in through terminal:
$ ssh -i /path/to/keypairs/keypair.pem [email protected]
  • example:
ssh -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem [email protected]
  1. Since SGA requires an old version of gcc and g++, you cannot use build-essentials:
  • Update/upgrade apt-get and some normal dependencies:
  - $ sudo apt-get update &&  sudo apt-get upgrade
  - $ sudo apt-get install clang libboost-all-dev libopenmpi-dev make cmake
  • Update/upgrade apt-get and some normal dependencies:
  - $ sudo vim /etc/apt/sources.list
Add the following to the bottom of this file:
deb http://dk.archive.ubuntu.com/ubuntu/ xenial main
deb http://dk.archive.ubuntu.com/ubuntu/ xenial universe
Then press esc, type `wq!` and press `enter`.
  - $ sudo apt install g++-5 gcc-5
  • Download and set up Google SparseHash Library
    - $ git clone https://github.com/sparsehash/sparsehash.git
    - $ cd sparsehash/
    - /sparsehash$ ./configure./configure
    - /sparsehash$ make
    - /sparsehash$ make check
    - /sparsehash$ sudo make install
    - /sparsehash$ make installcheck
    - /sparsehash$ cd
# add Sparsehash to the PATH
    - $ cd ../../usr/local/include
    - /usr/local/include$ export PATH=$PATH:$(pwd)
  • Download and set up Bamtools
    - $ cd
    - $ git clone git://github.com/pezmaster31/bamtools.git
    - $ cd bamtools/
    - /bamtools$ mkdir build
    - /bamtools$ cd build
    - /bamtools/build$ cmake -D CMAKE_INSTALL_PREFIX=/usr/local/include/ ..
    - /bamtools/build$ make
    - /bamtools/build$ sudo make DESTDIR=/usr/local/include/ install

  • Download and set up Jemmaloc
    - $ cd
    - $ git clone https://github.com/jemalloc/jemalloc.git
    - $ cd jemalloc/
    - /jemalloc$ ./autogen.sh 
    - /jemalloc$ make
    - /jemalloc$ sudo make install
  1. Install String Graph Assembler (SGA)
    - $ cd
    - $ git clone https://github.com/jts/sga.git
    - $ cd sga/src
    - /sga/src$ ./autogen.sh
    - /sga/src$ ./configure --with-sparsehash=/home/ubuntu/ --with-bamtools=/usr/local/include/usr/local/include/ --with-jemalloc=/home/ubuntu/jemalloc/lib
    - /sga/src$ make
    - /sga/src$ sudo make install
  1. Make a folder to organize your data and the results that will come from the assembly: mkdir data_folder_name
  • example:
mkdir my_genome_assembly
  1. Copy the data to this new folder
  • From your local computer using scp:
scp -i /path/to/keypairs/keypair.pem local/path/to/data/filename.fastq [email protected]:~/data_folder_name
  • example:
scp -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem local_path_to_data_files/sequencing_files.fastq [email protected]:~/my_genome_assembly

Now to use it, please reference the discussion at the top of this github page.

Go back to GBI AWS Wiki