discovery - spiralgenetics/biograph GitHub Wiki

The biograph discovery command takes a BioGraph and a reference as inputs and produces a VCF file of putative variants. The read coverage of each variant is a rough estimate, which is refined in later steps. Variant calling typically takes about an hour.

(bg7)$ biograph discovery --in my.bg --ref hs37d5/ --out discovered.hs37d5.vcf

The reference used for calling does not need to match the reference used for the create step:

(bg7)$ biograph discovery --in my.bg --ref grch38/ --out discovered.grch38.vcf

The VCF file created by the biograph discovery command will contain many low-quality variants and should not be used for tertiary analysis. It is intended to be used as the input for the next step in the BioGraph pipeline.

The discovery step outputs unsorted, uncompressed VCF, which must be sorted and compressed for subsequent steps. Omitting the --out parameter will write the output to STDOUT for easy streaming:

(bg7)$ biograph discovery --in my.bg --ref grch38/ \
  | vcf-sort | bgzip > discovery.vcf.gz && tabix discovery.vcf.gz

Using BED regions for faster calling

The biograph discovery command will attempt to assemble reads across the entire reference. To constrain calling to a specific genomic region, use the --bed option. The BED file should contain the regions you wish to include:

$ cat chr1.bed
chr1    10000    248946422
$ biograph discovery --in my.bg --ref grch38/ --out chr1.vcf --bed chr1.bed

While BioGraph can call variants on the entire genome, you can save significant time by using a BED file that includes only regions of interest. The references available at s3://spiral-public/references/ include a regions.bed file that contains only the autosomes and sex chromosomes. It specifically excludes alt contigs, decoys, mitochondria, telomeres, centromeres, and known regions of heterochromatin. If your downstream analysis will not use results from these regions, running discovery with --bed regions.bed can save substantial processing time.

(bg7)$ biograph discovery --in my.bg --ref grch38/ --out genome.vcf \
  --bed grch38/regions.bed

Specifying a minimum overlap

The minimum overlap parameter --min-overlap specifies the minimum fraction of a read that must overlap another to be considered for assembly. The default is .7, or 70% of a read's length. For example, if your reads are 100 bases long, then assemblies will be created with reads that overlap each other by 70 bases or more. For 150 base reads, reads must overlap by 105 bases or more.

Reducing the --min-overlap will result in more potential assemblies. This may increase sensitivity as well as the likelihood of false positives due to erroneous assemblies and will require additional processing time. This can be useful for low coverage datasets when the default overlap is too stringent and false positives can be tolerated.

Increasing the --min-overlap will result in more strict assemblies, with more overlapping bases required. This can reduce the false discovery rate but may also reduce sensitivity. It is useful for high coverage datasets where the expected read overlap is higher, or when the false discovery rate must be minimized and a loss of sensitivity can be tolerated.

The default of 0.7 provides the best balance for most use cases.

Specifying ploidy

biograph discovery will report variants on up to two alleles by default. If working with non-diploid datasets, specify a different value with --max-ploids.

Getting help

You can see all available options by specifying the --help switch.

(bg7)$ biograph discovery --help
discovery version 7.0.0

Usage: discovery [OPTIONS] --in <biograph> --ref <reference path> --out <vcf name>

Call variants on a BioGraph.

General Options:
  -h [ --help ]            Display this help message
  --help-all               Show help for advanced options
  --tmp arg                Basepath to temporary space. Defaults to a random
                           directory under /tmp/
  --in arg                 Input BioGraph to process
  --ref arg                Reference directory
  --out arg (=-)           Output VCF file
  --sample arg             Sample ID (Accession ID, uuid, or coverage file) to
                           process. Only required if the BioGraph contains
                           multiple samples.
  -f [ --force ]           Overwrite existing output file

Variant Calling Options:
  --bed arg                If specified, only call in the regions contained in
                           the given BED file.
  --min-overlap arg (=.7)  Minimum overlap required between reads when tracing
                           paths, as a fraction of the read length (0.5-0.9
                           recommended)
  --max-ploids arg (=2)    Maximum number of alleles to output
⚠️ **GitHub.com Fallback** ⚠️