February 2025

Egon A. Ozer, MD PhD ([email protected])
Ramon Lorenzo Redondo, PhD ([email protected])

Figure 1: Reference-based alignment example. From: McVicar, Nathaniel et al. ArXiv abs/1805.00106 (2018)

In this section we're going to perform reference-based alignments using bacterial sequencing reads. Alignments can be more sensitive to identifying single nucleotide variants and small insertions or deletions. This is especially the case if you have repeated sequences or mixed populations as might be the case in some viral infections. Also, if your reference is very similar to your sequenced isolate, alignment can allow for greater confidence in variant calls and direct comparisons.

Before we start:

Make a directory for analysis output with your NetID (or whatever you want) in our workshop directory:

cd /projects/b1042/OzerLab/workshop/playground

mkdir -p <netid>/alignment

cd <netid>/alignment

Now activate the assembly conda environment:

conda activate /projects/p30002/condaenvs/alignment_env

Step 1 - Align reads to reference sequence

We'll start with alignment of the S. pyogenes reads we used in the assembly example against the reference genome using bwa.

bwa mem \
  /projects/b1042/OzerLab/workshop/reference/GAS_NGAS638.fasta \
  /projects/b1042/OzerLab/workshop/bacteria_assembly/workshop_reads/GAS_1.fastq.gz \
  /projects/b1042/OzerLab/workshop/bacteria_assembly/workshop_reads/GAS_2.fastq.gz | \
  samtools view -Sbt /projects/b1042/OzerLab/workshop/reference/GAS_NGAS638.fasta.fai -F 4 - | \
  samtools sort -o GAS.sorted.bam -

samtools index GAS.sorted.bam

Step 2 - Mark duplicate reads

Duplicate reads can occur due to overamplification of the squencing library with PCR or preferential amplification of a region or regions. Optical duplication is when large cluster on the sequencing flow cell are counted more than once. In either case, the result is over-representation of certain reads which can lead bias in the results.

These steps will mark duplicate reads in the alignment so they can be accounted for in later analysis steps.

samtools sort \
  -n -o GAS.namesort.bam GAS.sorted.bam

samtools fixmate \
  -m GAS.namesort.bam GAS.fixmate.bam

samtools sort \
  -o GAS.positionsort.bam GAS.fixmate.bam

samtools markdup \
  GAS.positionsort.bam GAS.markdup.bam

samtools index GAS.markdup.bam

Step 3 - Call consensus with bcftools

We'll use bcftools mpileup to generate a pileup of the alignment and pipe that to bcftools call to generate a VCF file of single nucleotide variants

bcftools mpileup \
  -E -Q 25 -q 30 -m 2 -a DP,SP,AD,ADF,ADR --ns 3072 \
  -f /projects/b1042/OzerLab/workshop/reference/GAS_NGAS638.fasta -Ou GAS.markdup.bam | \
  bcftools call -m -V indels -Ov --ploidy 1 - \
  > GAS.vcf

mpileup options:

Option	Explanation
`-E`	Recalculate per-Base Alignment Quality (BAQ)
`-Q 25`	Skip bases with BAQ < 25
`-q 30`	Skip alignments with mapping quality < 30
`-m 2`	Minimum number gapped reads for indel candidates (default)
`-a`	Add additional information to the output (see VCF link above for more info)
`--ns 3072`	Skip any reads with SAM flag 3072, i.e. marked as duplicate or supplementary

call options:

Option	Explanation
`-m`	Multiallelic base caller
`-V indels`	Skip indel sites (only output SNVs)
`-Ov`	Output uncompressed VCF-format file
`--ploidy 1`	Assume reference is haploid

Step 4 - Generate consensus sequence

This script uses the output of the VCF file to generate a consensus sequence based on the reference.

perl /projects/p30002/Scripts/bcftools_filter.pl \
  -f /projects/b1042/OzerLab/workshop/reference/GAS_NGAS638.fasta \
  -h \
  -o GAS \
  GAS.vcf \
  > GAS.consensus.fasta \
  2> GAS.bcftools_filter_log.txt

I'll bet there isn't a script to do all these steps in one go.

Au contraire, mon frere!

mkdir alignment_batch

cd alignment_batch

# Copy the list of file prefixes to this directory
cp /projects/b1042/OzerLab/workshop/bacteria_alignment/workshop_reads/list.txt .

You have to provide the reference genome sequence to the -f option. As always, don't forget to add your email address surrounded by single quotes the -e flag.

perl /projects/p30002/batch_scripts/alignment_batch.pl \
  -o . \
  -r /projects/b1042/OzerLab/workshop/bacteria_alignment/workshop_reads \
  -s list.txt \
  -f /projects/b1042/OzerLab/workshop/reference/PAO1.fasta \ 
  -e '<your email address>'

And here's a little script to output some statistics about the alignments:

perl /projects/p30002/batch_scripts/alignment_batch_stats.pl list.txt

Alternative approach: Snippy

Snippy is a nice "all-in-one" pipeline for generating alignments and using those alignments to determine variants. If you supply a genbank file as the reference sequence, the program will also annnotate the variants, i.e. call synonymous, non-synonymous, or frameshift. This works well for single alignments and annotating variants, but is a little harder to work with for phylogenetic analyses of multiple sequences (YMMV).

Commands

Run this command as a Slurm job script.

snippy.sh

#!/bin/bash
#SBATCH --job-name="snippy"
#SBATCH -A b1042
#SBATCH -p genomics
#SBATCH -t 01:00:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=12
#SBATCH -o snippy.out
#SBATCH -e snippy.err

eval "$(conda shell.bash hook)"
conda activate /projects/p30002/condaenvs/alignment_env

snippy \
	--outdir GAS_snippy \
	--reference /projects/b1042/OzerLab/workshop/reference/GAS_NGAS638.gbk \
	--R1 /projects/b1042/OzerLab/workshop/bacteria_assembly/workshop_reads/GAS_1.fastq.gz \
	--R2 /projects/b1042/OzerLab/workshop/bacteria_assembly/workshop_reads/GAS_2.fastq.gz \
	--cpus 12

Settings

Setting	Description
`--outdir`	Name of the directory the output files will go into. You'll get an error if this directory already exists
`--reference`	Genome sequence to use as an alignment reference. Can be fasta or genbank file.
`--R1` & `--R2`	Paired-end read files
`--cpus`	Number of compute cores to use.

For more detail on settings, visit https://github.com/tseemann/snippy or run snippy --help

Outputs

Adapted from https://github.com/tseemann/snippy. Follow the link to see the full list of output files.

Extension	Description
.tab	A simple tab-separated summary of all the variants
.html	A HTML version of the .tab file for viewing in a web browser like Chrome or Safari
.bam	The alignments in BAM format. Includes unmapped, multimapping reads. Excludes duplicates.
.bam.bai	Index for the .bam file
.aligned.fa	A version of the reference but with `-` at position with `depth=0` and `N` for `0 < depth < --mincov` (Note: snippy manual says this file "does not have variants," but in the current version 4.6.0 it does incorporate the single nucleotide variants)
.consensus.fa	A version of the reference genome with all variants instantiated (substitutions and indels), but no masking of positions under minimum depth
.consensus.subs.fa	A version of the reference genome with only substitution variants instantiated, but no masking of positions under minimum depth
.log	A log file with the commands run and their outputs

Back to table of contents

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

11 Alignment and Variant Calling - NU-CPGME/quest_genomics_2025 GitHub Wiki

Step 1 - Align reads to reference sequence

Step 2 - Mark duplicate reads

Step 3 - Call consensus with bcftools

Step 4 - Generate consensus sequence

I'll bet there isn't a script to do all these steps in one go.

Alternative approach: Snippy

Back to table of contents

⚠️ GitHub.com Fallback ⚠️

11 Alignment and Variant Calling - NU-CPGME/quest_genomics_2025 GitHub Wiki

Step 1 - Align reads to reference sequence

Step 2 - Mark duplicate reads

Step 3 - Call consensus with bcftools

Step 4 - Generate consensus sequence

I'll bet there isn't a script to do all these steps in one go.

Alternative approach: Snippy

Back to table of contents

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️