2B Reference Based Alignment - NU-CPGME/sl_workshop_2024 GitHub Wiki
July / August, 2024
Developed by:
Egon A. Ozer, MD PhD ([email protected])
Ramon Lorenzo Redondo, PhD ([email protected])

Figure 1: Reference-based alignment example. From: McVicar, Nathaniel et al. ArXiv abs/1805.00106 (2018)
In this section we're going to perform reference-based alignments using the sequencing reads. Alignments can be more sensitive to identifying single nucleotide variants and small insertions or deletions. This is especially the case if you have repeated sequences or mixed populations as might be the case in some viral infections. Also, if your reference is very similar to your sequenced isolate, alignment can allow for greater confidence in variant calls and direct comparisons.
We'll start with alignment of the S. pyogenes reads against the reference genome. Snippy is a nice "all-in-one" pipeline for generating alignments and using those alignments to determine variants. If you supply a genbank file as the reference sequence, the program will also annnotate the variants, i.e. call synonymous, non-synonymous, or frameshift.
Commands
micromamba deactivate
micromamba activate alignment
cd ~/sl_workshop_2024/demo_data
snippy \
--outdir GAS_alignment \
--reference reference/GAS_NGAS638.gbk \
--R1 reads/GAS_1.fastq.gz \
--R2 reads/GAS_2.fastq.gz \
--cpus 1 \
--ram 4
Note
In the command above we are limiting the alignment to just one thread (--cpus 1
) and a maximum of 4 Gb of memory (--ram 4
). This is only because we are working with limited computer resources in the VM for the workshop. On a normal laptop or workstation you would probably use more threads and shouldn't have to set the --ram
option at all.
Click here for a note on MacOS installation
>
Settings
Setting | Description |
---|---|
--outdir |
Name of the directory the output files will go into. You'll get an error if this directory already exists |
--reference |
Genome sequence to use as an alignment reference. Can be fasta or genbank file. |
--R1 & --R2
|
Paired-end read files |
--cpus |
Number of compute cores to use. The default is 8. I set it a bit lower in the example command to account for the variety of computers being used in the workshop. Generally you'll want to use all available cores for faster analysis. |
--ram |
Limits RAM usage. The VirtualMachine has very limited RAM so we're limiting usagee here to 4 Gb. Usually you will not need to set this. |
For more detail on settings, visit https://github.com/tseemann/snippy or run snippy --help
Outputs
Adapted from https://github.com/tseemann/snippy. Follow the link to see the full list of output files.
Extension | Description |
---|---|
.tab | A simple tab-separated summary of all the variants |
.html | A HTML version of the .tab file for viewing in a web browser like Chrome or Safari |
.bam | The alignments in BAM format. Includes unmapped, multimapping reads. Excludes duplicates. |
.bam.bai | Index for the .bam file |
.aligned.fa | A version of the reference but with - at position with depth=0 and N for 0 < depth < --mincov (Note: snippy manual says this file "does not have variants," but in the current version 4.6.0 it does incorporate the single nucleotide variants) |
.consensus.fa | A version of the reference genome with all variants instantiated (substitutions and indels), but no masking of positions under minimum depth |
.consensus.subs.fa | A version of the reference genome with only substitution variants instantiated, but no masking of positions under minimum depth |
.log | A log file with the commands run and their outputs |
In this section we'll perform an alignment using a viral genome instead. This is somewhat different than the bacterial genome example above. For instance, the viral genome sequencing library was generated using overlapping PCR amplicons (Figure 2). This means some portions of the reads may contain the PCR primer binding sites which will have sequences matching the primer sequences rather than the sequence of the virus you are studying. Other portions of reads from a neighboring amplicon site will have the same regions sequenced but as they are between the primer sites, these sequences will reflect the true virus sequence (hopefully!). So we need to trim primer sequences from the reads after alignment to avoid missing variants.

Figure 2. Representation of tiled amplicon sequencing strategy. Bright red small boxes are primer binding sites and darker red strips between them represent amplified sequences.
In this example we'll perform alignment and variant calling of a SARS-CoV-2 viral genome.
Align reads to the reference sequence using bwa mem
and use samtools view
and samtools sort
to generate a sorted alignment file in the bam binary format. We're going to use piping (the |
character) to send the output of one command directly into the input of the next command. This saves space and time. Just don't forget to include the -
dash character in the next command to let the program know to expect its input is coming from the previous command. The last command samtools index
creates a small index file (ending with .bai
) that needs to be with the bam file in order for other programs to access it.
Commands
bwa mem \
reference/nCoV-2019.reference.fasta \
reads/COV1650_1.fastq.gz \
reads/COV1650_2.fastq.gz | \
samtools view -bS -F 4 - | \
samtools sort -o COV1650_sorted.bam -
samtools index COV1650_sorted.bam
Settings
bwa
Setting | Description |
---|---|
mem |
Use the BWA-MEM algorithm for the alignment. This is best for short reads produced by Illumina. |
<reference file> | A reference sequence in fasta format. Usually you need to index the reference file using bwa index <reference file> before running the alignment, but we're providing a pre-indexed reference to save time. |
See the bwa manual http://bio-bwa.sourceforge.net/bwa.shtml for more detail on settings.
samtools
Command | Setting | Description |
---|---|---|
view |
-b |
Output bam-formatted file |
view |
-S |
Input is sam-formatted file |
view |
-F |
Filter output. In this case '4' means do not output reads that don't have any alignments to the reference. This saves some disk space. |
sort |
-o |
Name of the file to output |
See the samtools manual http://www.htslib.org/doc/samtools.html for more detail on settings
Outputs
File | Description |
---|---|
COV1650_sorted.bam | Alignment file in bam binary format |
COV1650_sorted.bam.bai | Alignment index file |
Trim amplification primers from the reads in the alignment using iVar.
Commands
ivar trim \
-q 20 \
-i COV1650_sorted.bam \
-b reference/nCoV-2019.ivar.bed \
-p COV1650.trimmed
samtools sort -o COV1650.trimmed_sorted.bam COV1650.trimmed.bam
samtools index COV1650.trimmed_sorted.bam
Settings
ivar
Command | Setting | Description |
---|---|---|
trim |
-q |
Minimum sliding window quality threshold |
trim |
-i |
Input alignment file, in bam format |
trim |
-b |
File containing coordiinates of primer binding sites along the reference genome, in BED format. |
trim |
-p |
Prefix of output files |
Review the iVar manual https://andersen-lab.github.io/ivar/html/manualpage.html if you want more detail about commands and settings.
Outputs
File | Description |
---|---|
COV1650_trimmed_sorted.bam | Alignment file in bam binary format with primer sequences trimmed from reads |
COV1650_trimmed_sorted.bam.bai | Alignment index file |
Use iVar to generate a table of variants as well as a consensus genome sequence. We'll first use samtools mpileup
to generate a pileup file of read base calls at each position along the reference. This pileup file will be used as input to iVar.
Commands
Command 2.3.1: Generate pileup file
samtools mpileup \
-aa \
-A \
-d 0 \
-Q 0 \
--reference reference/nCoV-2019.reference.fasta \
COV1650.trimmed_sorted.bam \
> COV1650.pileup.txt
Command 2.3.2: Call variants based on pileup output
ivar variants \
-p COV1650.variants \
-t 0.03 \
< COV1650.pileup.txt
Command 2.3.3: Generate consenus sequence based on pileup output
ivar consensus \
-m 10 \
-q 20 \
-t 0 \
-p COV1650.consensus \
< COV1650.pileup.txt
Settings
samtools mpileup
Setting | Description |
---|---|
-aa |
Output base calls at absolutely all genome positions, even those with no aligned reads |
-A |
Do not discard anomalous pairs, i.e. pairs of reads where only one aligned to the reference |
-d |
Maximmum read depth. Setting to '0' disables limit |
-Q |
Minimum read base quality. Setting to '0' disables limit. We're going to filter in iVar. |
--reference |
Reference genome sequence, in fasta format |
ivar
Command | Setting | Description |
---|---|---|
variants |
-p |
Output file prefix |
variants |
-t |
Minimum frequence threshold to call variants. '0.03' means at least 3% of reads covering a position must be different than the reference base to be output. |
consensus |
-m |
Minimum read depth to call consenus |
consensus |
-q |
Minimum quality score threshold to count base |
consensus |
-t |
Minim frequency threshold to call consensus. '0' means that the most common base at a position will be called the consensus. |
consensus |
-p |
Output file prefix |
Outputs:
File | Description |
---|---|
COV1650.variants.tsv | Tab-delmited file showing variants and their read support |
COV1650.consensus.fa | Consensus genome sequence including all variants passing filters |
You can view in the variants.tsv file in your terminal using less COV1650.variants.tsv
or in a text editor or you can open it in Excel. These are the columns in the file:
- REGION: Reference sequence name
- POS: Position in the reference
- REF: Base in the reference sequence
- ALT: Base in the sequenced genome (ALTernate)
- REF_DP: Number of reads with the reference base
- REF_RV: Number of reference base reads aligning in the reverse orientation
- REF_QUAL: Average quality of the reference bases
- ALT_DP: Number of reads with the alternate base
- ALT_RV: Number of alternate bases on reverse reads
- ALT_QUAL: Average quality of the alternate bases
- ALT_FREQ: Frequency (proportion) of the alternate base
- TOTAL_DP: Total read depth at this position
- PVAL: p-value of Fisher's exact test
- PASS: "True" if p-value <= 0.05
The file also includes the columns "GFF_FEATURE", "REF_CODON", "REF_AA", "ALT_CODON", and "ALT_AA" but these will be empty because we didn't give any annotation information to ivar variants
.
You can look at bam-formatted alignment files right in your Terminal using the command
samtools view COV1650.trimmed_sorted.bam | less
We're piping the output to less
or else you'll fill your Terminal with thousands of lines. Hit q
to exit when you're done.
An easier way to visualize aligments in bam files is to use Tablet:
-
Open Tablet from "Other" in your program menu in the bottom left hand corner. Use the "Open Assembly" button in the to left to select your files. To view an alignment file, you'll need to give Tablet two files: 1) the alignment bam file, COV1650.trimmed_sorted.bam (make sure its .bai index file is in the same directory as the .bam file) and 2) the reference fasta sequence we used, nCoV-2019.reference.fasta.
-
Once loaded, select the sequence you want to view in the left bar. Since our reference file only had one sequence, there is only one entry in the list.
-
Let's examine one variant location from the variants file. There were three variants identified from postions 28,881 to 28,883. You could drag the window over to this position, but it's easier to use the "Jump to Base" function in the top bar. Select "Jump to Base", type "28881" into the box, and click "Padded Jump".
-
Make sure you have the "Tag Variants" coloring scheme selected in the "Visual" portion of the top bar to quickly see variants in the aligned reads relative to the reference sequence.
Using Tablet can be helpful for quickly visualizing and validating important or suspicious variants in read alignments.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.