Variant calling using PacBio HiFi CCS data - PacificBiosciences/CoSA GitHub Wiki

Last Updated: 03/10/2020

Python Requirements

Python 3.7+
BioPython
PyVCF
mappy and panda (if using pbaa)

We recommend looking into Anaconda for easily managing Python packages.

PacBio Tool Prerequisites

ccs
lima
optional - pbaa if using pbaa for variant calling.

These PacBio tools can be obtained either through SMRTLink or pbbioconda

Other Prerequisites

CoSA - this repository. Follow the install directions.
samtools
bamtools
optional - bcftools, if using bcftools for variant calling.

(1) Generate CCS data
(2) (if multiplexed) de-multiplex different patient samples
(3) Trim amplicon primers
(4) Variant calling using (a)bcftools, (b)DeepVariant, (c)pbaa
(5) Generating consensus sequence using VCFCons
(6) Expected differences if using SMRTLink vs pbbioconda, and other tool versions
(7) Assign lineages using Pangolin or Nextclade

Misc useful tools:

Downsampling reads per amplicon
Generating per amplicon coverage BED file

1. Generate CCS data

ccs movie.subreads.bam ccs.bam

If using SMRTLink: Run "Circular Consensus Sequencing (CCS)" application. The output file is <movie>.hifi_reads.bam.

2. (if multiplexed) de-multiplex different patient samples

Depending on how the samples were multiplexed, use lima to demultiplex accordingly. Starting in step (3), you should be working on single patient samples as input.

If you are using the M13 barcodes, for example, the fasta sequences may look like this:

>bc1001_M13F
CACATATCAGAGTGCGGTAAAACGACGGCCAGT
>bc1002_M13F
ACACACAGACTGTGAGGTAAAACGACGGCCAGT
...
>bc1066_M13R
TATATGCTCTGTGTGACAGGAAACAGCTATGAC
>bc1067_M13R
CTCTATATATCTCGTCCAGGAAACAGCTATGAC

NOTE: please make sure in this (first) lima step, you are providing barcode sequences that include both the M13 universal sequence and the 16bp barcodes. DO NOT use the SARS-CoV-2 amplicon primer sequences in this step.

The exact lima parameters may need to be adjusted, currently we recommend:

lima --num-threads 16 \
     --split-bam-named \
     --different \
     --ccs \
     --min-score-lead 10 \
     --min-score 80
     ccs.bam M13bacodes.fasta demux.bam

We recommend removing --peek-guess as this can cause issues in patient samples with low yield. Use --min-score-lead 10 --min-score 80 to help make sure the barcodes are truly called correctly and reads are assigned to the right patient samples.

After running lima to demultiplex the barcodes, you may end up with multiple F/R barcode pairs relating to a single patient. To make these demux ccs bam files into separate directories by patient, make a tab-delimited, sample metadata file sample_barcodes.metadata.txt like this:

BarcodeName Sample
bc1001_M13F--bc1066_M13R patient1
bc1002_M13F--bc1067_M13R patient1
bc1003_M13F--bc1068_M13R patient2
bc1004_M13F--bc1069_M13R patient2

Then run the script combine_demux_by_patient.py from CoSA:

combine_demux_by_patient.py \
    sample.demux \
    sample_barcodes.metadata.txt \
    [--fofn_only]

use --fofn_only will generate just a ccs.fofn text file showing the location of the ccs bam files, instead of actually creating a new, merged ccs bam file per patient.

After running this script you should get two directories:

patient1
└── demux.bam (or demux.fofn)
patient2
└── demux.bam (or demux.fofn)

If using SMRTLink: Run "Demultiplex Barcodes" application. Set Minimum Barcode Score in Output BAM and Minimum Barcode Score of Output Datasets (XML) both to 80, which is different from the default. See Advanced Options screenshot example. As part of running the SMRTLink application, you will be asked to provide a barcode-to-sample assignment via the Assign Bio Sample Names to Barcodes field, so you do not need to run CoSA script combine_demux_by_patient.py. See File Download screenshot example.

After this step, everything should be analyzed on a per-patient sample basis.

3. Trim amplicon primers

Trimming away amplicon primers help with reducing false positive variant calls.

Run lima to identify and trim the amplicon primers. If you had a multiplexed patient sample, this would be your second time running lima - but you are now running lima on a per-patient-sample basis, where reads have been stripped of barcodes including the M13 universal sequence. The Freed primers are provided as an amplicon primer fasta example.

lima -j 30 \
     --neighbors \
     --ccs \
     --min-score-lead 10 \
     --min-score 80 \
     --window-size-multi 1.1 \
     <demux.bam or ccs.fofn, etc> \
     amplicon_primers.fasta \
     output.bam

Option 1. Use --neighbors to indicate distinct F/R primer pairs that are not re-used

Place the F/R primers adjacent in the primer fasta file and call lima with --neighbors. In this way, only valid amplicons (ex: 3F--3R) will be output. We've noticed non-valid amplicons (ex: 3F--10R) tend to be artifacts. To use --neighbors, your primer fasta must list F/R in adjacent manner like below:

>SARSCoV_1200_1_LEFT
ACCAACCAACTTTCGATCTCTTGT
>SARSCoV_1200_1_RIGHT
GGTTGCATTCATTTGGTGACGC
>SARSCoV_1200_10_LEFT
TTTACCAGGAGTTTTCTGTGGTGT
>SARSCoV_1200_10_RIGHT
TGGGCCTCATAGCACATTGGTA
...

Option 2. Use --different if some primer sequences are re-used

If you are using, say, ARTIC primers that have 1-to-many primer pairings, run lima with --different. Later, when you run subsample_ampclions.py, you can provide a whitelist of valid primer pairs using --valid_pairs_file.

4. Variant Calling

There are several options for variant calling. All of them will produce a VCF file.

caller	input	output	calls large (>100bp) indels?
bcftools	aligned BAM	VCF	N
DeepVariant	aligned BAM	VCF	N
pbaa	fastq & guide ref	VCF (after conversion)	Y

4(a). variant calling using bcftools

bcftools accepts an aligned BAM pileup file and produces a VCF file.

To produce the input BAM file, use the output from the previous step, which is a per-patient, primer-trimmed CCS BAM file.

bamtools convert -format fastq -in per_patient.primer_trimmed.bam > per_patient.primer_trimmed.fastq
minimap2 -a NC_045512.fa per_patient.primer_trimmed.fastq > sample.aligned.sam
samtools view -bS sample.aligned.sam | samtools sort > sample.aligned.bam
samtools index sample.aligned.bam

samtools mpileup --min-BQ 1 -f NC_045512.fa -s sample.aligned.bam > sample.aligned.bam.mpileup
bcftools mpileup -f NC_045512.fa sample.aligned.bam | bcftools call -mv -Ov -o sample.vcf

NOTE: bcftools can produce low-quality/frequency variant calls. Use <a href="#consensus>VCFCons in the next section to filter.

An example bash script for running bcftools combined with VCFCons.py is run_VCFCons_bcftools.sh

4(b). variant calling using DeepVariant

DeepVariant accepts an aligned BAM file and produces a VCF file. You will probably want to store a local .sif file before running this for the first time. If you run singularity pull docker://google/deepvariant:1.1.0 once, you should get a local .sif file and you can point to that in your bash script below.

#!/bin/bash

set -x

#BIN_VERSION=1.1.0
# before you run this script, run the following from the command line
# singularity pull docker://google/deepvariant:1.1.0

time singularity exec --bind $PWD,/scratch:/tmp,/usr/lib/locale/ \
  <local_path>/deepvariant_1.1.0.sif \
    <path_to>/bin/run_deepvariant \
    --model_type PACBIO \
    --ref NC_045512.fa \
    --reads sample.aligned.bam \
    --output_vcf sample.vcf \
    --num_shards 30

set +x

An example bash script for running VCFCons.py after running DeepVariant is run_VCFCons_deepvariant.sh

4(c). variant calling using pbaa

pbaa accepts an input fastq and a guide reference, and outputs a cluster sequence. We need additional scripts to convert the cluster sequence to VCF format so we can run VCFCons.py.

The guide reference for pbaa must be a non-overlapping portions of the ref genome where each amplicon is designed based upon. Since amplicons do overlap, you can pick a middle point between the amplicon intersection to define guides. It's ok if guides don't cover the whole amplicon - they are guides to place the input sequences. See the Freed primer set-based pbaa guide reference as an example.

Make sure your guide has a fasta index by running the following command once:

samtools faidx sarscov2_guide_for_pbaa.fasta

We can now run pbaa. Note that unlike bcftools and DeepVariant, pbaa does not take an input aligned BAM file. Instead we convert the primer-trimmed (post-lima) BMA file into FASTQ format and run pbaa on it. We do, however, provide the original ref genome (NC_045512.2.fasta) after running pbaa through two scripts to convert the output into VCF.

bamtools convert -format fastq -in per_patient.primer_trimed.bam > sample.fastq
samtools faidx sample.fastq

pbaa cluster --min-cluster-read-count 2 --trim-ends 0 \
   sarscov2_guide_for_pbaa.fasta sample.fastq sample_pbaa

consensusVariants.py -r sample_pbaa \
                     -p sample_pbaa \
                     --read_info sample_pbaa_read_info.txt \
                     --hifiSupport sample.fastq \
                     NC_045512.2.fasta \
                     sample_pbaa_passed_cluster_sequences.fasta

pbaa2vcf.py --passOnly -s barcode \
            -o sample_pbaa.vcf \
            sample_pbaa_alleles.csv \
            sample_pbaa_variants.csv \
            NC_045512.2.fasta

After you obtain the output VCF sample_pbaa.vcf you can run it through VCFCons.py.

An example bash script for running pbaa is run_pbaa.sh. After running this, you can follow up with VCFCons.py as shown in run_VCFCons_pbaa.sh

5. Generating consensus sequence using VCFCons

VCFCons.py accepts an input VCF and outputs a consensus sequence. See this figure for a visual understanding of VCFCons.

usage: VCFCons.py [-h] [--input_depth INPUT_DEPTH] [--input_vcf INPUT_VCF]
                  [-c MIN_COVERAGE] [-f MIN_ALT_FREQ] [-q MIN_QUAL] --vcf_type
                  {pbaa,deepvariant,CLC,bcftools}
                  ref_fasta prefix

positional arguments:
  ref_fasta             Reference fasta (should be Wuhan ref)
  prefix                Sample prefix

optional arguments:
  -h, --help            show this help message and exit
  --input_depth INPUT_DEPTH
                        (optional) Input depth file, if not given, then
                        <prefix>.aligned.bam.depth is expected.
  --input_vcf INPUT_VCF
                        (optional) Input VCF file, if not given, then
                        <prefix>.VCF is expected.
  -c MIN_COVERAGE, --min_coverage MIN_COVERAGE
                        Minimum base coverage to call a base (default: 4)
  -f MIN_ALT_FREQ, --min_alt_freq MIN_ALT_FREQ
                        Minimum variant frequency (default: 0.5)
  -q MIN_QUAL, --min_qual MIN_QUAL
                        Minimum QUAL cutoff (default: 100)
  --vcf_type {pbaa,deepvariant,CLC,bcftools}
                        VCF format info

If you run with the command VCFCons.py NC_045512.fa sample, it expect to see the following files:

sample.vcf            # VCF output from one of the variant callers
sample.aligned.bam.depth      # depth file of the input aligned to ref genome

If your file names are a bit different for the depth and VCF file, use --input_depth and --input_vcf to specify it.

You can get the depth file using samtools depth.

For each position in the genome, VCFCons will output bases according to the following:

If the position is covered by less than --min_coverage reads, output 'N'
If the position has a variant call and the variant frequency is greater than --min_alt_freq, output ALT base; otherwise output REF
If the position has a variant call with QUAL less than --min_qual, output REF instead

The output of VCFCons.py contains:

sample.vcfcons.vcf            # VCF file with low qual variants removed
sample.vcfcons.fasta          # a single consensus sequence, useful for data submission 
sample.vcfcons.frag.fasta     # consensus sequence broken up by 'N' regions, useful for mapping

Expected differences if using SMRTLink vs pbbioconda, and other tool versions

Below documents observed differences and how they may affect the results:

Using minimap2 vs pbmm2 to map reads to the genome

If using pbmm2 for alignment, we recommend using the parameter set:

pbmm2 align --sort --preset HIFI NC_045512.2.fasta \
      per_patient.primer_trimmed.fastq > sample.aligned.bam

The --preset HIFI uses a particular set of parameters different from minimap2 default:

CCS or HiFi : -k 19 -w 10 -u -o 5 -O 56 -e 4 -E 1 -A 2 -B 5 -z 400 -Z 50  -r 2000   -L 0.5 -g 5000

As such, using minimap2 vs pbmm2 can result in different variant calls (if using bcftools, DeepVariant, Freebayes, or anything that uses an aligned BAM to call variants). It can also result in difference of low covered bases being presented as "N"s instead of REF/ALT bases.

Tool versions to note

The following tools can have different versions depending on when you last installed/updated them. As such, they can be expected to produce slightly different results.

pbmm2         // PacBio tool, for mapping reads to the genome
lima          // PacBio tool, for demultiplexing barcodes and amplicon primers
pbaa          // PacBio tool, for variant calling
minimap2      // for mapping reads to the genome
samtools      // for creating aligned BAM depth file to generate 'N', REF, or ALT bases on VCFCons consensus
bcftools      // for variant calling
DeepVariant   // for variant calling

Assign lineages using Pangolin or Nextclade

You can use nextclade and pangolin to assign lineages to the SARS-CoV2 genomic sequences.

They can be downloaded and installed in their own conda enviroment

# for nextclade
%conda activate nextclade
%nextclade -i inp.fasta -t output.tsv
 
# for pangolin
%conda activate pangolin
%pangolin --outfil output_resport.csv  inp.fasta

It is important to regularly update both nextclade and pangolin since algorithm as well as the strain information are routinely updated.

Other misc useful scripts

Downsampling reads by amplicon

This is optional and only useful, for, say, if certain amplicons have >10000-fold coverage.

Informative output files from lima include output.lima.summary and output.lima.counts. It is the output.lima.counts file that the CoSA script subsample_amplicons.py uses to downsample certain amplicons.

subsample_amplicons.py \
   output \
   [--valid_pairs_file valid_pairs.txt] \
   --subsample_size 1000 > cmd

If you had run lima with --different and want to indicate which primer pairs are valid, use the --valid_pairs_file option and provide a tab-delimited like this valid_pairs.txt file.

This will generate a cmd file that has the commands for subsampling each amplicon bam file. For example:

samtools view -b -s 0.24 output.nCoV_3F--nCoV_3R.bam | bamtools convert -format fastq > output.nCoV_3F--nCoV_3R.subsampled.fastq
samtools view -b -s 0.10 output.nCoV_5F--nCoV_5R.bam | bamtools convert -format fastq > output.nCoV_5F--nCoV_5R.subsampled.fastq

Execute the cmd file, then concatenate the outputs into a single FASTQ file.

bash cmd
cat output*subsampled.fastq > subsampled.ccs.Q20.fastq

Generating per amplicon coverage BED file

You will likely want to see a coverage BED file that is before the subsampling. To do this, use the output.lima.counts output file after running lima with a BEDPE file that gives the pairings of primers for each expected amplicon.

generate_cov_bed_from_lima_counts.py \
     output.lima.counts \
     artic.expected_amplicon.bedpe \
     --bedpe > output.lima.counts.bedgraph

To generate a BEDGRAPH file that you can load into IGV to show the per-amplicon coverage. See here for an example for amplicon BEDPE