Customizing the BioGraph Pipeline - spiralgenetics/biograph GitHub Wiki

The entire BioGraph pipeline can be run from start to finish using biograph full_pipeline. Under the hood, the pipeline consists of several steps, most of which include several tunable options.

The the following steps are run in order:

biograph create converts NGS reads to the BioGraph format.
biograph discovery calls variants against a reference with the greatest possible sensitivity.
biograph coverage examines each putative variant and computes read coverage.
truvari anno grm is run to assess graph reference mappability.
biograph qual_classifier assigns a genotype and quality score to each variant, optionally tagging or filtering low quality variants.

Providing additional options

There are two methods for supplying additional options to each step:

Use --dry-run to create a script

You can create a shell script that contains the commands that are run by full_pipeline by using the --dry-run option. This script can then be edited to suit your own analysis needs.

(bg7)$ biograph full_pipeline ... --dry-run > my_pipeline.sh
(bg7)$ cat my_pipeline.sh
bgbinary create --reads my_reads.bam --tmp /tmp --ref my_ref/ --out my.bg --threads 48
bgbinary discovery --tmp /tmp --in my.bg --ref my_ref/ --threads 48 | vcf-sort | bgzip > my.bg/analysis/discovery.vcf.gz && tabix my.bg/analysis/discovery.vcf.gz
biograph coverage -b my.bg -r my_ref/ -v my.bg/analysis/discovery.vcf.gz | vcf-sort | bgzip > my.bg/analysis/coverage.vcf.gz && tabix my.bg/analysis/coverage.vcf.gz
truvari anno grm -r my_ref/source.fasta -i my.bg/analysis/coverage.vcf.gz -o my.bg/analysis/grm.jl
biograph qual_classifier --grm my.bg/analysis/grm.jl --df my.bg/analysis/df.jl -v my.bg/analysis/coverage.vcf.gz -m biograph_model-6.0.5.ml --tmp /tmp -t 48 | vcf-sort | bgzip > my.bg/analysis/results.vcf.gz && tabix my.bg/analysis/results.vcf.gz

The create and discovery steps call a program called bgbinary. This is a component of BioGraph that implements several tools in C++ and is installed in your environment alongside the python components. The biograph create and biograph discovery commands are python wrappers around these bgbinary commands, and are functionally identical.

Use the extra parameters option for each step

You can pass additional options to any full_pipeline step by specifying an extra parameter option. This example uses --create and --discovery to include several options:

(bg7)$ biograph full_pipeline --biograph my.bg --ref hs37d5/ \
  --reads /path/to/my_reads_1.fq.gz \
  --model /path/to/biograph_model-7.0.0.ml \
  --create "--pair /path/to/my_reads_2.fq.gz --max-mem 100" \
  --discovery "--bed /path/to/my_regions.bed"

You can supply additional parameters to any step using the appropriate option for each step (--create, --discovery, --coverage, --grm, or --qual_classifier.)

Common optional parameters

BioGraph uses common paramater conventions for specifying options. They may be specified in long format (--reference) or abbreviated to the shortest non-ambiguous form (--ref). Some options may also be specified by a single letter (-r). You can get a list of available options by appending --help (or -h) to any command, or consult the online docs.

Some options are common to many steps:

--tmp should specify a path to large, fast, local storage. The default is the value of the $TMPDIR environment variable, or /tmp/ if unset. See Optimizing Performance for details.
--threads indicates the number of threads to use. The default (auto) will use one thread per available CPU. This is optimal for most environments, but see Optimizing Performance for how to fine-tune this parameter on memory constrained systems.
--reference (-r) selects the genetic reference, in BioGraph reference format.

The rest of this section will walk through each step in detail.

Step 1: create

The create step converts reads from BAM, CRAM, or FASTQ format into the BioGraph format. For a typical human sample (150bp paired reads at 30x coverage), a 120GB BAM file is converted into a 29GB BioGraph in approximately one hour on a 64 core machine.

The biograph create command requires the input reads, an output biograph path, and the path to the BioGraph reference. If your /tmp/ directory doesn't have sufficient space, include a path to larger temporary storage as well.

(bg7)$ biograph create --reads my_reads.bam --tmp /tmp --ref my_ref/ --out my.bg

The resulting BioGraph consists of a directory containing the read overlap graph, coverage data, sample metadata, and conversion logs. See What is Inside a BioGraph? for a detailed breakdown of all of the files contained in the BioGraph directory.

The biograph create command has many more options for optimizing performance and tailoring the results to your specific dataset. See create for more details.

Step 2: discovery

The biograph discovery step takes a BioGraph and a reference as inputs, and produces a VCF file of putative variants. Variant calling typically takes about a half hour on a typical 30x dataset, and writes unsorted VCF to STDOUT by default. The result should be sorted, compressed, and indexed for subsequent steps.

Intermediary files and analysis results are typically kept in the analysis/ folder inside the BioGraph.

(bg7)$ biograph discovery --in my.bg --ref hs37d5/ | vcf-sort | bgzip > \
    my.bg/analysis/discovery.vcf.gz && tabix my.bg/analysis/discovery.vcf.gz

The reference used for calling does not need to match the reference used for the create step:

(bg7)$ biograph discovery --in my.bg --ref grch38/ | vcf-sort | bgzip > \
    my.bg/analysis/discovery.vcf.gz && tabix my.bg/analysis/discovery.vcf.gz

The VCF file created by the biograph discovery command will contain many low quality variants and should not be used for analysis directly. It is intended to be used as the input for the next step in the BioGraph pipeline.

See discovery for additional variant calling options.

Step 3: coverage

The biograph coverage step takes a BioGraph, reference, and the discovery VCF as inputs. It produces a VCF containing coverage metrics over the variant graph. These metrics include things like depth over graph nodes and edges, overlap of reads used to traverse sequences in the graph, and an estimated genotype based on reference/alternate edges' proportional coverage. These metrics inform the qual_classifier for determining variant quality as a higher accuracy genotype.

As with the discovery step, unsorted VCF is streamed to STDOUT by default.

(bg7)$ biograph coverage -b my.bg -r my_ref/ -v my.bg/analysis/discovery.vcf.gz \
  | vcf-sort | bgzip > my.bg/analysis/coverage.vcf.gz \
  && tabix my.bg/analysis/coverage.vcf.gz

Coverage generation typically takes ten to fifteen minutes. Additional coverage options are covered in coverage.

Step 4: grm

After coverage, truvari grm is run. GRM, or graph-reference mappability, is a procedure to take kmers centered on edges in a variant graph and remap them to the reference. GRM then collects metrics such as the number of hits of the kmer through the reference and their mapping quality.

This step produces a dataframe that is used by the qual_classifier step.

(bg7)$ truvari anno grm -r my_ref/source.fasta -i my.bg/analysis/coverage.vcf.gz \
  -o my.bg/analysis/grm.jl

Step 5: qual_classifier

The biograph qual_classifier step assigns a genotype and quality score to each variant, optionally tagging or filtering low quality variants. It takes the coverage VCF, grm dataframe, and a model file as inputs and produces the final filtered and tagged results VCF. It also creates an intermediary dataframe representation of the coverage VCF.

(bg7)$ biograph qual_classifier --grm my.bg/analysis/grm.jl \
  --df my.bg/analysis/df.jl -v my.bg/analysis/coverage.vcf.gz \
  -m biograph_model-7.0.0.ml | vcf-sort | bgzip > my.bg/analysis/results.vcf.gz \
  && tabix my.bg/analysis/results.vcf.gz

As with the other biograph steps, unsorted VCF is streamed to STDOUT by default. It is typically sorted, compressed, and indexed for consumption by downstream tools.

Final cleanup

The intermediary files generated by each step in the pipeline are automatically removed by full_pipeline at the end of the analysis. These files can be quite large and are generally not required. However, you may wish to keep some or all intermediaries for QC or other purposes.

The --keep vcf parameter will keep intermediary VCFs, --keep jl will keep all dataframes, and --keep all will keep everything.

Cleanup is not performed if you are running a custom script created by --dry-run. Be sure to remove any files you don't need from the analysis/ folder inside your BioGraph.

Archiving the results

The resulting BioGraph folder contains about 30GB of data for a 30x sample, and slightly more for higher coverage samples. The analysis/results.vcf.gz is considerably smaller, on the order of a couple of hundred megabytes compressed. The log files and stats in the qc/ folder are only a few megabytes.

If your environment has ample storage, it typically makes sense to keep the BioGraph for further analysis (for joint genotyping calls in a larger population, calling against a different reference, or using different parameters). If your environment has constrained storage space but sufficient compute resources, it may be advantageous to keep only the analysis/ and qc/ folders once your initial analysis is complete, and recreate the BioGraph from reads as needed.

Commands in Depth

For more details on biograph and other commands, consult these pages:

Next: What is Inside a BioGraph?