11 Alignment and Variant Calling - NU-CPGME/quest_genomics_2025 GitHub Wiki
February 2025
Egon A. Ozer, MD PhD ([email protected])
Ramon Lorenzo Redondo, PhD ([email protected])

Figure 1: Reference-based alignment example. From: McVicar, Nathaniel et al. ArXiv abs/1805.00106 (2018)
In this section we're going to perform reference-based alignments using bacterial sequencing reads. Alignments can be more sensitive to identifying single nucleotide variants and small insertions or deletions. This is especially the case if you have repeated sequences or mixed populations as might be the case in some viral infections. Also, if your reference is very similar to your sequenced isolate, alignment can allow for greater confidence in variant calls and direct comparisons.
Before we start:
Make a directory for analysis output with your NetID (or whatever you want) in our workshop directory:
cd /projects/b1042/OzerLab/workshop/playground
mkdir -p <netid>/alignment
cd <netid>/alignment
Now activate the assembly conda environment:
conda activate /projects/p30002/condaenvs/alignment_env
We'll start with alignment of the S. pyogenes reads we used in the assembly example against the reference genome using bwa
.
bwa mem \
/projects/b1042/OzerLab/workshop/reference/GAS_NGAS638.fasta \
/projects/b1042/OzerLab/workshop/bacteria_assembly/workshop_reads/GAS_1.fastq.gz \
/projects/b1042/OzerLab/workshop/bacteria_assembly/workshop_reads/GAS_2.fastq.gz | \
samtools view -Sbt /projects/b1042/OzerLab/workshop/reference/GAS_NGAS638.fasta.fai -F 4 - | \
samtools sort -o GAS.sorted.bam -
samtools index GAS.sorted.bam
Duplicate reads can occur due to overamplification of the squencing library with PCR or preferential amplification of a region or regions. Optical duplication is when large cluster on the sequencing flow cell are counted more than once. In either case, the result is over-representation of certain reads which can lead bias in the results.
These steps will mark duplicate reads in the alignment so they can be accounted for in later analysis steps.
samtools sort \
-n -o GAS.namesort.bam GAS.sorted.bam
samtools fixmate \
-m GAS.namesort.bam GAS.fixmate.bam
samtools sort \
-o GAS.positionsort.bam GAS.fixmate.bam
samtools markdup \
GAS.positionsort.bam GAS.markdup.bam
samtools index GAS.markdup.bam
We'll use bcftools mpileup to generate a pileup of the alignment and pipe that to bcftools call to generate a VCF file of single nucleotide variants
bcftools mpileup \
-E -Q 25 -q 30 -m 2 -a DP,SP,AD,ADF,ADR --ns 3072 \
-f /projects/b1042/OzerLab/workshop/reference/GAS_NGAS638.fasta -Ou GAS.markdup.bam | \
bcftools call -m -V indels -Ov --ploidy 1 - \
> GAS.vcf
mpileup
options:
Option | Explanation |
---|---|
-E |
Recalculate per-Base Alignment Quality (BAQ) |
-Q 25 |
Skip bases with BAQ < 25 |
-q 30 |
Skip alignments with mapping quality < 30 |
-m 2 |
Minimum number gapped reads for indel candidates (default) |
-a |
Add additional information to the output (see VCF link above for more info) |
--ns 3072 |
Skip any reads with SAM flag 3072, i.e. marked as duplicate or supplementary |
call
options:
Option | Explanation |
---|---|
-m |
Multiallelic base caller |
-V indels |
Skip indel sites (only output SNVs) |
-Ov |
Output uncompressed VCF-format file |
--ploidy 1 |
Assume reference is haploid |
This script uses the output of the VCF file to generate a consensus sequence based on the reference.
perl /projects/p30002/Scripts/bcftools_filter.pl \
-f /projects/b1042/OzerLab/workshop/reference/GAS_NGAS638.fasta \
-h \
-o GAS \
GAS.vcf \
> GAS.consensus.fasta \
2> GAS.bcftools_filter_log.txt
Au contraire, mon frere!
mkdir alignment_batch
cd alignment_batch
# Copy the list of file prefixes to this directory
cp /projects/b1042/OzerLab/workshop/bacteria_alignment/workshop_reads/list.txt .
You have to provide the reference genome sequence to the -f
option. As always, don't forget to add your email address surrounded by single quotes the -e
flag.
perl /projects/p30002/batch_scripts/alignment_batch.pl \
-o . \
-r /projects/b1042/OzerLab/workshop/bacteria_alignment/workshop_reads \
-s list.txt \
-f /projects/b1042/OzerLab/workshop/reference/PAO1.fasta \
-e '<your email address>'
And here's a little script to output some statistics about the alignments:
perl /projects/p30002/batch_scripts/alignment_batch_stats.pl list.txt
Snippy is a nice "all-in-one" pipeline for generating alignments and using those alignments to determine variants. If you supply a genbank file as the reference sequence, the program will also annnotate the variants, i.e. call synonymous, non-synonymous, or frameshift. This works well for single alignments and annotating variants, but is a little harder to work with for phylogenetic analyses of multiple sequences (YMMV).
Commands
Run this command as a Slurm job script.
snippy.sh
#!/bin/bash
#SBATCH --job-name="snippy"
#SBATCH -A b1042
#SBATCH -p genomics
#SBATCH -t 01:00:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=12
#SBATCH -o snippy.out
#SBATCH -e snippy.err
eval "$(conda shell.bash hook)"
conda activate /projects/p30002/condaenvs/alignment_env
snippy \
--outdir GAS_snippy \
--reference /projects/b1042/OzerLab/workshop/reference/GAS_NGAS638.gbk \
--R1 /projects/b1042/OzerLab/workshop/bacteria_assembly/workshop_reads/GAS_1.fastq.gz \
--R2 /projects/b1042/OzerLab/workshop/bacteria_assembly/workshop_reads/GAS_2.fastq.gz \
--cpus 12
Settings
Setting | Description |
---|---|
--outdir |
Name of the directory the output files will go into. You'll get an error if this directory already exists |
--reference |
Genome sequence to use as an alignment reference. Can be fasta or genbank file. |
--R1 & --R2
|
Paired-end read files |
--cpus |
Number of compute cores to use. |
For more detail on settings, visit https://github.com/tseemann/snippy or run snippy --help
Outputs
Adapted from https://github.com/tseemann/snippy. Follow the link to see the full list of output files.
Extension | Description |
---|---|
.tab | A simple tab-separated summary of all the variants |
.html | A HTML version of the .tab file for viewing in a web browser like Chrome or Safari |
.bam | The alignments in BAM format. Includes unmapped, multimapping reads. Excludes duplicates. |
.bam.bai | Index for the .bam file |
.aligned.fa | A version of the reference but with - at position with depth=0 and N for 0 < depth < --mincov (Note: snippy manual says this file "does not have variants," but in the current version 4.6.0 it does incorporate the single nucleotide variants) |
.consensus.fa | A version of the reference genome with all variants instantiated (substitutions and indels), but no masking of positions under minimum depth |
.consensus.subs.fa | A version of the reference genome with only substitution variants instantiated, but no masking of positions under minimum depth |
.log | A log file with the commands run and their outputs |
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.