February 2025

Egon A. Ozer, MD PhD ([email protected])
Ramon Lorenzo Redondo, PhD ([email protected])

Before we start:

Make a directory for analysis output with your NetID (or whatever you want) in our workshop directory:

cd /projects/b1042/OzerLab/workshop/playground

mkdir -p <netid>/covid

cd <netid>/covid

Now activate the assembly conda environment:

conda activate /projects/p30002/condaenvs/covid_assembly_env

Introduction

In this section we'll produce a consensus sequence for a viral genome sequence generated using overlapping PCR amplicons (Figure below). This means some portions of the reads may contain the PCR primer binding sites which will have sequences matching the primer sequences rather than the sequence of the virus you are studying. Other portions of reads from a neighboring amplicon site will have the same regions sequenced but as they are between the primer sites, these sequences will reflect the true virus sequence (hopefully!). So we need to trim primer sequences from the reads after alignment to avoid missing variants.

Figure. Representation of tiled amplicon sequencing strategy. Bright red small boxes are primer binding sites and darker red strips between them represent amplified sequences.

In this example we'll perform alignment and variant calling of a SARS-CoV-2 viral genome.

Step 1A - Align short (Illumina) reads to reference

Align reads to the reference sequence using bwa mem and use samtools view and samtools sort to generate a sorted alignment file in the bam binary format. We're going to use piping (the | character) to send the output of one command directly into the input of the next command. This saves storage space and time. Just don't forget to include the - dash character in the next command to let the program know to expect its input to come from the previous command. The last command samtools index creates a small index file (ending with .bai) that needs to be with the bam file in order for other programs to access it.

Commands

bwa mem \
	/projects/p30002/artic-ncov2019/primer_schemes/nCoV-2019/V5.3.2/SARS-CoV-2.reference.fasta \
	/projects/b1042/OzerLab/workshop/covid/workshop_reads/cov_ws_01_1.fastq.gz \
	/projects/b1042/OzerLab/workshop/covid/workshop_reads/cov_ws_01_2.fastq.gz | \
	samtools view -bS -F 4 - | \
	samtools sort -o cov_ws_01_sorted.bam -
samtools index cov_ws_01_sorted.bam

Settings

bwa

Setting	Description
`mem`	Use the BWA-MEM algorithm for the alignment. This is best for short reads produced by Illumina.
<reference file>	A reference sequence in fasta format. Usually you need to index the reference file using `bwa index <reference file>` before running the alignment, but we're providing a pre-indexed reference to save time.

See the bwa manual http://bio-bwa.sourceforge.net/bwa.shtml for more detail on settings.

samtools

Command	Setting	Description
`view`	`-b`	Output bam-formatted file
`view`	`-S`	Input is sam-formatted file
`view`	`-F`	Filter output. In this case '4' means do not output reads that don't have any alignments to the reference. This saves some disk space.
`sort`	`-o`	Name of the file to output

See the samtools manual http://www.htslib.org/doc/samtools.html for more detail on settings

Outputs

File	Description
cov_ws_01_sorted.bam	Alignment file in bam binary format
cov_ws_01_sorted.bam.bai	Alignment index file

Step 1B - Align long (Nanopore) reads to reference

If you are using long reads generated on the Nanopore sequencer this first step is fairly similar to using short reads, but we will use minimap2 instead of bwa to do the alignment.

minimap2 \
  -x map-ont -a \
  /projects/p30002/artic-ncov2019/primer_schemes/nCoV-2019/V5.3.2/SARS-CoV-2.reference.fasta \
  /projects/b1042/OzerLab/workshop/covid/workshop_reads/cov_nanopore.fastq.gz > cov_nanopore.sam

samtools view -bS -F 4 cov_nanopore.sam > cov_nanopore.bam

samtools sort -o cov_nanopore_sorted.bam cov_nanopore.bam

samtools index cov_nanopore_sorted.bam

Settings

Setting	Description
`-a`	Output SAM-formatted file
`-x map-ont`	Preset for nanopore alignments

Note

The map-ont setting was optimized for alignment of older, more error-prone nanopore reads against a reference. These reads were generated using an older R9.4.1 flowcell. Newer nanopore flowcells and basecallers are much more accurate. When aligning reads produced using R10.4.1 flowcells with high- or super-accuracy basecalling, you'll most likely use the lr:hq setting instead.

Step 2 - Trim primers

Trim amplification primers from the reads in the alignment using iVar.

Commands

ivar trim \
    -q 20 \
    -i cov_ws_01_sorted.bam \
    -b /projects/p30002/artic-ncov2019/primer_schemes/nCoV-2019/V5.3.2/SARS-CoV-2.scheme.bed \
    -p cov_ws_01.trimmed
samtools sort -o cov_ws_01.trimmed_sorted.bam cov_ws_01.trimmed.bam
samtools index cov_ws_01.trimmed_sorted.bam

Settings

ivar

Command	Setting	Description
`trim`	`-q`	Minimum sliding window quality threshold
`trim`	`-i`	Input alignment file, in bam format
`trim`	`-b`	File containing coordiinates of primer binding sites along the reference genome, in BED format.
`trim`	`-p`	Prefix of output files

Review the iVar manual https://andersen-lab.github.io/ivar/html/manualpage.html if you want more detail about commands and settings.

Outputs

File	Description
COV1650_trimmed_sorted.bam	Alignment file in bam binary format with primer sequences trimmed from reads
COV1650_trimmed_sorted.bam.bai	Alignment index file

Step 3 - Variant calling

Use iVar to generate a table of variants as well as a consensus genome sequence. We'll first use samtools mpileup to generate a pileup file of read base calls at each position along the reference. This pileup file will be used as input to iVar.

Commands

Command 3.1: Generate pileup file

samtools mpileup \
    -aa \
    -A \
    -d 0 \
    -Q 0 \
    --reference /projects/p30002/artic-ncov2019/primer_schemes/nCoV-2019/V5.3.2/SARS-CoV-2.reference.fasta  \
    cov_ws_01.trimmed_sorted.bam \
    > cov_ws_01.pileup.txt

Command 3.2: Call variants based on pileup output

ivar variants \
    -p cov_ws_01.variants \
    -t 0.03 \
    < cov_ws_01.pileup.txt

Command 3.3: Generate consenus sequence based on pileup output

ivar consensus \
    -m 10 \
    -q 20 \
    -t 0 \
    -p cov_ws_01.consensus \
    < cov_ws_01.pileup.txt

Settings

samtools mpileup

Setting	Description
`-aa`	Output base calls at absolutely all genome positions, even those with no aligned reads
`-A`	Do not discard anomalous pairs, i.e. pairs of reads where only one aligned to the reference
`-d`	Maximmum read depth. Setting to '0' disables limit
`-Q`	Minimum read base quality. Setting to '0' disables limit. We're going to filter in iVar.
`--reference`	Reference genome sequence, in fasta format

ivar

Command	Setting	Description
`variants`	`-p`	Output file prefix
`variants`	`-t`	Minimum frequence threshold to call variants. '0.03' means at least 3% of reads covering a position must be different than the reference base to be output.
`consensus`	`-m`	Minimum read depth to call consenus
`consensus`	`-q`	Minimum quality score threshold to count base
`consensus`	`-t`	Minim frequency threshold to call consensus. '0' means that the most common base at a position will be called the consensus.
`consensus`	`-p`	Output file prefix

Outputs:

File	Description
COV1650.variants.tsv	Tab-delmited file showing variants and their read support
COV1650.consensus.fa	Consensus genome sequence including all variants passing filters

You can view in the variants.tsv file in your terminal using less COV1650.variants.tsv or in a text editor or you can open it in Excel. These are the columns in the file:

REGION: Reference sequence name
POS: Position in the reference
REF: Base in the reference sequence
ALT: Base in the sequenced genome (ALTernate)
REF_DP: Number of reads with the reference base
REF_RV: Number of reference base reads aligning in the reverse orientation
REF_QUAL: Average quality of the reference bases
ALT_DP: Number of reads with the alternate base
ALT_RV: Number of alternate bases on reverse reads
ALT_QUAL: Average quality of the alternate bases
ALT_FREQ: Frequency (proportion) of the alternate base
TOTAL_DP: Total read depth at this position
PVAL: p-value of Fisher's exact test
PASS: "True" if p-value <= 0.05

The file also includes the columns "GFF_FEATURE", "REF_CODON", "REF_AA", "ALT_CODON", and "ALT_AA" but these will be empty because we didn't give any annotation information to ivar variants.

Is there already a script that does all these steps for me?

You know there is.

mkdir covid_batch

cd covid_batch

# Copy the list of file prefixes to this directory
cp /projects/b1042/OzerLab/workshop/covid/workshop_reads/list.txt .

This script runs all the commands above on a list of prefixes, but also adds a read trimming step prior to alignment. Again, don't forget to add your email address surrounded by single quotes the -e flag.

perl /projects/p30002/batch_scripts/covid_assembly_batch.pl \
  -o . \
  -r /projects/b1042/OzerLab/workshop/covid/workshop_reads \
  -s list.txt \
  -e '<your email address>'

Assessing variants with Nextclade

There are some command line tools that can be used to call SARS-CoV-2 lineages from consensus sequence assemblies, but honestly the online tool Nextclade is generally more up-to-date and easier to use.

Getting your data in to Nextclade is a snap, too.

Make sure you are in your covid_batch directory and all job arrays have finished.

## Clear your terminal first
clear

## Output the consensus.fa sequence results to the terminal.
cat */*.fa

Highlight all the sequences and copy (cmd-v or ctrl-v).

Now navigate to https://clades.nextstrain.org/ in your browser and in the "Provide sequence data" box click "Text" and paste your copied sequences into the box.

Make sure the "SARS-CoV-2" dataset is selected in the "Select reference dataset" box. Sometimes Nextclade will pick something weird like "SARS-CoV-2 (BA.2)" which will make your results a little funky.

Click the "Run" button to view your results. These can be exported as a .csv or .tsv file that can be imported into Excel or other programs as well.

Viewing alignments with Tablet

You can look at bam-formatted alignment files right in your Terminal using the command samtools view cov_ws_01.trimmed.sorted.bam | less. Make sure you're piping the output to less or else you'll fill your Terminal with thousands of lines. Just a little hint if this happens: you can interrupt a process in your Terminal before it finishes using the Ctrl-C keys.

An easier way to visualize aligments in bam files is using a program like Tablet:

Open Tablet and use the "Open Assembly" button in the to left to select your files. To view an alignment file, you'll need to give Tablet two files: 1) the alignment bam file, cov_ws_01.trimmed.sorted.bam (make sure its .bai index file is in the same directory as the .bam file) and 2) the reference fasta sequence we used, SARS-CoV-2.reference.fasta.
Once loaded, select the sequence you want to view in the left bar. Since our reference file only had one sequence, there is only one entry in the list.
Let's examine one variant location from the variants file. There were three variants identified from postions 28,881 to 28,883. You could drag the window over to this position, but it's easier to use the "Jump to Base" function in the top bar. Select "Jump to Base", type "28881" into the box, and click "Padded Jump".
Make sure you have the "Tag Variants" coloring scheme selected in the "Visual" portion of the top bar to quickly see variants in the aligned reads relative to the reference sequence.