full_pipeline - spiralgenetics/biograph GitHub Wiki

As outlined in the Quick Start, the biograph full_pipeline converts NGS reads into the BioGraph format and performs variant calling, genotyping, and filtering. This section outlines all of the full_pipeline parameters in detail.

Essential parameters

--biograph or -b: the path to the BioGraph to be created
--reference or -r: the path to the BioGraph reference
--tmp: path to temporary storage. Default: the value of $TMPDIR, or /tmp/ if unset
--threads or -t: number of concurrent threads. Default: one thread per available CPU

Step-specific parameters

--model: the path to the BioGraph model file. This is only required if running the qual_classifier step.
--reads: the input reads in BAM, CRAM, or FASTQ format. This is only required if running the create step. If your reads consist of multiple input files, you should consider using a custom pipeline script. To stream reads on STDIN, use --reads -. See Customizing the BioGraph Pipeline and create for more details.

Are your input reads in CRAM format? If so, the reference that was used to create the CRAM must match the --reference parameter used for the create step. While the resulting BioGraph is not tied to this reference, the full_pipeline command will use it for discovery, coverage, and truvari.

If you wish to use a different reference for analysis, create a custom pipeline script. See Customizing the BioGraph Pipeline for details.
--create, --discovery, --coverage, --grm, --qual_classifier: These optional parameters are passed to their respective steps verbatim. They should be supplied as a single string. Common parameters such as --tmp or --reference are automatically passed to the steps that require them and don't need to be included here:

  (bg7)$ biograph full_pipeline --biograph my.bg --ref hs37d5/ \
  --reads /path/to/my_reads_1.fq.gz \
  --model /path/to/biograph_model-7.0.0.ml \
  --create "--pair /path/to/my_reads_2.fq.gz --max-mem 100" \
  --discovery "--bed /path/to/my_regions.bed" \
  --coverage "--min-insert 500" \
  --grm "-k 60" \
  --qual_classifier "--filter 0.15 --lowqual_sv 0.3"

Skip stages with --stop and --resume

You can instruct full_pipeline to stop at any stage. To convert reads to BioGraph format without running further analysis, use --stop create. Since the qual_classifier step won't be run, there is no need to specify a model:

(bg7)$ biograph full_pipeline --biograph my.bg --ref hs37d5/ --stop create

You can also resume from any stage. This can be useful for performing additional analysis on an existing BioGraph, or changing a parameter for subsequent analysis.

For example, BioGraph files are reference agnostic. While a reference is used to speed up the create process, no reference information is stored in the BioGraph itself. The output BioGraph will be identical for a set of reads whether using hs37d5, GRCh38, or any other reference for the create step.

If you create a BioGraph using one reference and later decide to run an analysis on a different reference, there is no need to create another BioGraph from reads. Simply supply the path to the existing BioGraph and --resume discovery:

(bg7)$ biograph full_pipeline --biograph my.bg --ref grch38/ \
  --resume discovery \
  --model /path/to/biograph_model-7.0.0.ml

Additional full_pipeline parameters

--keep: keep intermediary files. --keep vcf for VCFs, --keep jl for dataframes, or --keep all for everything.
--dry-run: generate a step-by-step script on STDOUT. See Customizing the BioGraph Pipeline for details.
--force: overwrite any existing intermediary files (or the BioGraph if running create).
--help or -h: show all available options

(bg7)$ biograph full_pipeline --help
usage: full_pipeline [-h] -b BG -r REF [--reads READS] [-m ML] [--tmp TMP]
                     [-t THREADS] [--keep {all,jl,vcf}] [--dry-run] [--force]
                     [--resume RESUME] [--stop STOP] [--create CREATE]
                     [--discovery DISCOVERY] [--coverage COVERAGE] [--grm GRM]
                     [--qual_classifier QUAL_CLASSIFIER]

Run the standard BioGraph pipeline for a single sample: create, discovery, 
coverage, grm, qual_classifier

optional arguments:
  -h, --help            show this help message and exit
  -b BG, --biograph BG  BioGraph file (will be created if running the create
                        step)
  -r REF, --reference REF
                        Reference genome folder, in BioGraph reference format
  --reads READS         Input reads for BioGraph create, if run (fastq, bam,
                        cram)
  -m ML, --model ML     BioGraph classifier model for qual_classifier, if run
  --tmp TMP             Temporary directory (/tmp)
  -t THREADS, --threads THREADS
                        Number of threads to use (32)
  --keep {all,jl,vcf}   Keep intermediate dataframes/VCFs or all
  --dry-run             Run a preflight check, print all steps to be run, and
                        exit
  --force               Overwrite any existing intermediary files

Pipeline Arguments:
  Control which section of the pipeline is run

  --resume RESUME       Step at which to resume the pipeline
  --stop STOP           Last step of the pipeline to run

Individual step arguments:
  Specify any additional parameters to be passed to steps. Must be a single 
  "string"

  --create CREATE       create parameters
  --discovery DISCOVERY
                        discovery parameters
  --coverage COVERAGE   coverage parameters
  --grm GRM             truvari grm parameters
  --qual_classifier QUAL_CLASSIFIER
                        qual_classifier parameters

Customizing the BioGraph Pipeline