10 Viral Alignment Consensus - NU-CPGME/quest_genomics_2025 GitHub Wiki
February 2025
Egon A. Ozer, MD PhD ([email protected])
Ramon Lorenzo Redondo, PhD ([email protected])
Before we start:
Make a directory for analysis output with your NetID (or whatever you want) in our workshop directory:
cd /projects/b1042/OzerLab/workshop/playground
mkdir -p <netid>/covid
cd <netid>/covid
Now activate the assembly conda environment:
conda activate /projects/p30002/condaenvs/covid_assembly_env
In this section we'll produce a consensus sequence for a viral genome sequence generated using overlapping PCR amplicons (Figure below). This means some portions of the reads may contain the PCR primer binding sites which will have sequences matching the primer sequences rather than the sequence of the virus you are studying. Other portions of reads from a neighboring amplicon site will have the same regions sequenced but as they are between the primer sites, these sequences will reflect the true virus sequence (hopefully!). So we need to trim primer sequences from the reads after alignment to avoid missing variants.

Figure. Representation of tiled amplicon sequencing strategy. Bright red small boxes are primer binding sites and darker red strips between them represent amplified sequences.
In this example we'll perform alignment and variant calling of a SARS-CoV-2 viral genome.
Align reads to the reference sequence using bwa mem
and use samtools view
and samtools sort
to generate a sorted alignment file in the bam binary format. We're going to use piping (the |
character) to send the output of one command directly into the input of the next command. This saves storage space and time. Just don't forget to include the -
dash character in the next command to let the program know to expect its input to come from the previous command. The last command samtools index
creates a small index file (ending with .bai
) that needs to be with the bam file in order for other programs to access it.
Commands
bwa mem \
/projects/p30002/artic-ncov2019/primer_schemes/nCoV-2019/V5.3.2/SARS-CoV-2.reference.fasta \
/projects/b1042/OzerLab/workshop/covid/workshop_reads/cov_ws_01_1.fastq.gz \
/projects/b1042/OzerLab/workshop/covid/workshop_reads/cov_ws_01_2.fastq.gz | \
samtools view -bS -F 4 - | \
samtools sort -o cov_ws_01_sorted.bam -
samtools index cov_ws_01_sorted.bam
Settings
bwa
Setting | Description |
---|---|
mem |
Use the BWA-MEM algorithm for the alignment. This is best for short reads produced by Illumina. |
<reference file> | A reference sequence in fasta format. Usually you need to index the reference file using bwa index <reference file> before running the alignment, but we're providing a pre-indexed reference to save time. |
See the bwa manual http://bio-bwa.sourceforge.net/bwa.shtml for more detail on settings.
samtools
Command | Setting | Description |
---|---|---|
view |
-b |
Output bam-formatted file |
view |
-S |
Input is sam-formatted file |
view |
-F |
Filter output. In this case '4' means do not output reads that don't have any alignments to the reference. This saves some disk space. |
sort |
-o |
Name of the file to output |
See the samtools manual http://www.htslib.org/doc/samtools.html for more detail on settings
Outputs
File | Description |
---|---|
cov_ws_01_sorted.bam | Alignment file in bam binary format |
cov_ws_01_sorted.bam.bai | Alignment index file |
If you are using long reads generated on the Nanopore sequencer this first step is fairly similar to using short reads, but we will use minimap2
instead of bwa
to do the alignment.
minimap2 \
-x map-ont -a \
/projects/p30002/artic-ncov2019/primer_schemes/nCoV-2019/V5.3.2/SARS-CoV-2.reference.fasta \
/projects/b1042/OzerLab/workshop/covid/workshop_reads/cov_nanopore.fastq.gz > cov_nanopore.sam
samtools view -bS -F 4 cov_nanopore.sam > cov_nanopore.bam
samtools sort -o cov_nanopore_sorted.bam cov_nanopore.bam
samtools index cov_nanopore_sorted.bam
Settings
Setting | Description |
---|---|
-a |
Output SAM-formatted file |
-x map-ont |
Preset for nanopore alignments |
Note
The map-ont
setting was optimized for alignment of older, more error-prone nanopore reads against a reference. These reads were generated using an older R9.4.1 flowcell. Newer nanopore flowcells and basecallers are much more accurate. When aligning reads produced using R10.4.1 flowcells with high- or super-accuracy basecalling, you'll most likely use the lr:hq
setting instead.
Trim amplification primers from the reads in the alignment using iVar.
Commands
ivar trim \
-q 20 \
-i cov_ws_01_sorted.bam \
-b /projects/p30002/artic-ncov2019/primer_schemes/nCoV-2019/V5.3.2/SARS-CoV-2.scheme.bed \
-p cov_ws_01.trimmed
samtools sort -o cov_ws_01.trimmed_sorted.bam cov_ws_01.trimmed.bam
samtools index cov_ws_01.trimmed_sorted.bam
Settings
ivar
Command | Setting | Description |
---|---|---|
trim |
-q |
Minimum sliding window quality threshold |
trim |
-i |
Input alignment file, in bam format |
trim |
-b |
File containing coordiinates of primer binding sites along the reference genome, in BED format. |
trim |
-p |
Prefix of output files |
Review the iVar manual https://andersen-lab.github.io/ivar/html/manualpage.html if you want more detail about commands and settings.
Outputs
File | Description |
---|---|
COV1650_trimmed_sorted.bam | Alignment file in bam binary format with primer sequences trimmed from reads |
COV1650_trimmed_sorted.bam.bai | Alignment index file |
Use iVar to generate a table of variants as well as a consensus genome sequence. We'll first use samtools mpileup
to generate a pileup file of read base calls at each position along the reference. This pileup file will be used as input to iVar.
Commands
Command 3.1: Generate pileup file
samtools mpileup \
-aa \
-A \
-d 0 \
-Q 0 \
--reference /projects/p30002/artic-ncov2019/primer_schemes/nCoV-2019/V5.3.2/SARS-CoV-2.reference.fasta \
cov_ws_01.trimmed_sorted.bam \
> cov_ws_01.pileup.txt
Command 3.2: Call variants based on pileup output
ivar variants \
-p cov_ws_01.variants \
-t 0.03 \
< cov_ws_01.pileup.txt
Command 3.3: Generate consenus sequence based on pileup output
ivar consensus \
-m 10 \
-q 20 \
-t 0 \
-p cov_ws_01.consensus \
< cov_ws_01.pileup.txt
Settings
samtools mpileup
Setting | Description |
---|---|
-aa |
Output base calls at absolutely all genome positions, even those with no aligned reads |
-A |
Do not discard anomalous pairs, i.e. pairs of reads where only one aligned to the reference |
-d |
Maximmum read depth. Setting to '0' disables limit |
-Q |
Minimum read base quality. Setting to '0' disables limit. We're going to filter in iVar. |
--reference |
Reference genome sequence, in fasta format |
ivar
Command | Setting | Description |
---|---|---|
variants |
-p |
Output file prefix |
variants |
-t |
Minimum frequence threshold to call variants. '0.03' means at least 3% of reads covering a position must be different than the reference base to be output. |
consensus |
-m |
Minimum read depth to call consenus |
consensus |
-q |
Minimum quality score threshold to count base |
consensus |
-t |
Minim frequency threshold to call consensus. '0' means that the most common base at a position will be called the consensus. |
consensus |
-p |
Output file prefix |
Outputs:
File | Description |
---|---|
COV1650.variants.tsv | Tab-delmited file showing variants and their read support |
COV1650.consensus.fa | Consensus genome sequence including all variants passing filters |
You can view in the variants.tsv file in your terminal using less COV1650.variants.tsv
or in a text editor or you can open it in Excel. These are the columns in the file:
- REGION: Reference sequence name
- POS: Position in the reference
- REF: Base in the reference sequence
- ALT: Base in the sequenced genome (ALTernate)
- REF_DP: Number of reads with the reference base
- REF_RV: Number of reference base reads aligning in the reverse orientation
- REF_QUAL: Average quality of the reference bases
- ALT_DP: Number of reads with the alternate base
- ALT_RV: Number of alternate bases on reverse reads
- ALT_QUAL: Average quality of the alternate bases
- ALT_FREQ: Frequency (proportion) of the alternate base
- TOTAL_DP: Total read depth at this position
- PVAL: p-value of Fisher's exact test
- PASS: "True" if p-value <= 0.05
The file also includes the columns "GFF_FEATURE", "REF_CODON", "REF_AA", "ALT_CODON", and "ALT_AA" but these will be empty because we didn't give any annotation information to ivar variants
.
You know there is.
mkdir covid_batch
cd covid_batch
# Copy the list of file prefixes to this directory
cp /projects/b1042/OzerLab/workshop/covid/workshop_reads/list.txt .
This script runs all the commands above on a list of prefixes, but also adds a read trimming step prior to alignment. Again, don't forget to add your email address surrounded by single quotes the -e
flag.
perl /projects/p30002/batch_scripts/covid_assembly_batch.pl \
-o . \
-r /projects/b1042/OzerLab/workshop/covid/workshop_reads \
-s list.txt \
-e '<your email address>'
There are some command line tools that can be used to call SARS-CoV-2 lineages from consensus sequence assemblies, but honestly the online tool Nextclade is generally more up-to-date and easier to use.
Getting your data in to Nextclade is a snap, too.
Make sure you are in your covid_batch
directory and all job arrays have finished.
## Clear your terminal first
clear
## Output the consensus.fa sequence results to the terminal.
cat */*.fa
Highlight all the sequences and copy (cmd-v or ctrl-v).
Now navigate to https://clades.nextstrain.org/ in your browser and in the "Provide sequence data" box click "Text" and paste your copied sequences into the box.

Make sure the "SARS-CoV-2" dataset is selected in the "Select reference dataset" box. Sometimes Nextclade will pick something weird like "SARS-CoV-2 (BA.2)" which will make your results a little funky.

Click the "Run" button to view your results. These can be exported as a .csv or .tsv file that can be imported into Excel or other programs as well.

You can look at bam-formatted alignment files right in your Terminal using the command samtools view cov_ws_01.trimmed.sorted.bam | less
. Make sure you're piping the output to less
or else you'll fill your Terminal with thousands of lines. Just a little hint if this happens: you can interrupt a process in your Terminal before it finishes using the Ctrl-C keys.
An easier way to visualize aligments in bam files is using a program like Tablet:
-
Open Tablet and use the "Open Assembly" button in the to left to select your files. To view an alignment file, you'll need to give Tablet two files: 1) the alignment bam file, cov_ws_01.trimmed.sorted.bam (make sure its .bai index file is in the same directory as the .bam file) and 2) the reference fasta sequence we used, SARS-CoV-2.reference.fasta.
-
Once loaded, select the sequence you want to view in the left bar. Since our reference file only had one sequence, there is only one entry in the list.
-
Let's examine one variant location from the variants file. There were three variants identified from postions 28,881 to 28,883. You could drag the window over to this position, but it's easier to use the "Jump to Base" function in the top bar. Select "Jump to Base", type "28881" into the box, and click "Padded Jump".
-
Make sure you have the "Tag Variants" coloring scheme selected in the "Visual" portion of the top bar to quickly see variants in the aligned reads relative to the reference sequence.
Using Tablet can be helpful for quickly visualizing and validating important or suspicious variants in read alignments.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.