1E Short Read Sequence Assembly - NU-CPGME/sl_workshop_2024 GitHub Wiki
July / August, 2024
Developed by:
Egon A. Ozer, MD PhD ([email protected])
Ramon Lorenzo Redondo, PhD ([email protected])
Now we'll generate a de novo whole genome assembly from our trimmed reads. For this we'll use the assembler SPAdes (Github site).
Commands
Activate mamba environment if you haven't already.
micromamba deactivate
micromamba activate assembly
Navigate to the demo_data
folder if you're not already there.
cd ~/sl_workshop_2024/demo_data
The following commands will generate a de novo assembly of a Streptococcus pyogenes genome from Illumina short reads using SPAdes:
spades.py \
-o GAS_assembly \
-1 GAS_trimmed_paired_1.fastq.gz \
-2 GAS_trimmed_paired_2.fastq.gz \
--cov-cutoff auto \
--only-assembler \
-k 33,77,127
Explanation of Settings
Setting | Descripton |
---|---|
-o |
Name of the directory where the results will be output |
-1 & -2
|
Forward and reverse read paired-end files, respectively |
--cov-cutoff auto |
Remove contigs below a minimum amount of read coverage |
See SPAdes manual for version 3.15.2 for more detail on available options.
Note
Usually when you are using SPAdes to perform an assembly you'll also want to use the --careful
setting. This setting reduces mismatches and short indels by aligning reads back to the contigs to check for errors. We are only skipping it here to save processing time in the workshop.
We are also setting the kmer sizes (-k
) manually here to save processing time for the workshop. Best practice is to let Spades set the kmer sizes automatically.
Outputs
All of the output files can be found in the GAS_assembly
folder. There are a lot of files in there, so we're just going to pick a few of the most relevant to describe.
Files | Description |
---|---|
contigs.fasta |
Assembly contig sequences. |
scaffolds.fasta |
Scaffold sequnences. Scaffolds consist of contigs joined by N's at sites with paired reads support. |
spades.log |
Log file. Useful for troubleshooting poor or failed assemblies. |
assembly_graph.fastg |
Graph of assembly showing contig connections. Can be useful for assembly quality assessment. |
Visualizing the Assembly Graph
Sometimes looking at the assembly graph produced by Spades can be useful to get a sense of how contiguous the assembly is.
To do so we can use Bandage.
sudo apt install bandage
Bandage
In the Bandage window, select File --> Load Graph and select the "assembly_graph.fastg" graph file in the GAS_assembly directory.

In the graph the thick colored bars are the assembly sequence contigs with sizes proportional to their sequence lengths. The thin black lines show regions of ambiguity where the assembler identified two or more possible linkages between contigs but could not resolve them. You can drag the sequences in the window or zoom in for a better look at certain regions, or you can overlay the contigs with information such as the read depth.
After assembly you'll want some sense of the quality of the assembly. This usually includes, but is not limited to, the total number of contigs or scaffolds, the total length of the sequence, and the GC content.
- Low numbers of contigs usually indicates a pretty good assembly where many unambiguous connections could be made. High numbers of contigs may indicate low read coverage or possible genome or read contamination from another source (though not always)
- Total contig length (sum of the lengths of all contigs) should be compared to the expected genome size of the species or organism. NCBI Genome is a good source of expected genome sizes.
- Percent GC content (i.e. the percentage of the sequence that is either cytosine or guanine bases) should also be compared to the expected genome GC content.
- N50 is length of the contig such that contigs that length or longer account for >= 50% of the total assembly size. A small N50 value may indicate a more fragmented assembly.
A tool that can be used for calculating these values is Quast. This software is available as either a command-line program or through an interactive website: http://cab.cc.spbu.ru/quast/. As the web version is intermittently down, we have already installed the command-line version with micromamba.
Commands
quast GAS_assembly/contigs.fasta
Settings
Clearly there are not very many settings for the basic quast functionality, just the path to the contigs file.
You can also use quast to compare to a closely related genome sequence using the -r
setting to determine possible misassemblies. Just be warned that if your reference genome sequence is not very closely related, you may end up with a number of false-positive missassembly warnings. Caveat emptor.
Outputs
By default, outputs will be put into a directory called "quast_results". Here are a few of the output files worth noting:
Files | Description |
---|---|
report.html |
A report file that can be opened in a web browser like Chrome or Safari. |
report.pdf |
A pdf version of the assembly report |
report.tsv |
A tab-separated version of the report table that can be viewed in a spreadsheet program like Excel or in a text editor |
Below is some example output from our assembly viewed in a web browser.

FYI: The N50 value is used to assess assembly contiguity. It is defined as the length of the shortest contig for which longer and equal length contigs cover at least 50% of the assembly.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.