study export - spiralgenetics/biograph GitHub Wiki

The biograph vdb study export command performs the following actions:

  • All variant entries in the study at the given checkpoint are merged, creating a single representation for every unique chrom + pos + ref + alt.
  • Merged variants are optionally annotated with a desired vdb annotation.
  • The merged and optionally annotated results are exported to a local VCF file.

The duration of the merge process depends on the number of variants in the study, scaling at about 30 to 40 seconds per sample for WGS. Smaller studies merge in a minute or two, while a large study of 600 samples should merge in about 15 minutes.

Sorted, uncompressed VCF is written to STDOUT by default. This may be piped through bgzip for compression, or written directly to a file using --output / -o.

$ biograph vdb study export my_study -o my_study.vcf

$ biograph vdb study export my_study | bgzip > my_study.vcf.gz

Exporting an earlier checkpoint

The latest checkpoint is used for export by default. Use --checkpoint to export an earlier checkpoint. This can be useful for comparing the merged output before and after a filtering step, without the need to apply additional filters.

$ biograph vdb study show my_study
study_name: my_study
      created_on: 2021-05-18 13:02:51
           build: GRCh38
         refname: grch38

checkpoints:
   1: added HG002: 2e7a0129-13a5-44b6-8594-2fc2e6c80e6c; HG003: a98626d6-758b-468d-add9-fbfbac47d207; HG004: 9a174215-8fc5-4c6f-bc4b-134654f65b99
   2: include GT = '0/1'
   3: include SVTYPE = 'INS'

sample_name      variant_count
HG002            16447
HG003            15883
HG004            18035

# export het SV inserts only
$ biograph vdb study export my_study

# export all hets
$ biograph vdb study export my_study --checkpoint 2

# export all variants
$ biograph vdb study export my_study --checkpoint 1

Choosing export fields

If not all FORMAT fields are required, use the --fields option to specify a colon separated list of desired fields. Using fewer fields results in a smaller VCF and faster export times. The GT field is always included in the output, even if it is not specified.

# Only include genotype, depth, and allelic depth
$ biograph vdb study export my_study --fields GT:DP:AD

Forcing a remerge

A merge is performed automatically when a checkpoint is exported for the first time. Subsequent exports use the existing merge data for speed. However, if the exported fields have changed using the --fields option, a subsequent export will use the same fields that were requested at the initial export.

To force a remerge and overwrite the existing merge data, include the --remerge / -r option.

Annotations

Use --anno to add variant annotations to the exported VCF. Specify the name of the annotation to be applied. Use biograph vdb anno import to import new annotations.

An annotation can only be applied if it uses a compatible genetic reference. Export will abort if no annotation exists with the given name and a reference that matches the study.

# This study uses GRCh38, but the only available dbSNP uses GRCh37
$ biograph vdb study export my_study --anno dbSNP
There is no annotation named dbSNP with a matching reference build.

$ biograph vdb anno list
anno_name            version      imported_on          build      annotations   aid                                  description
ClinVar              2020-10-03   2021-04-28 15:55:55  GRCh37     775850        c8d42e04-80ae-415d-a848-2b162dcb7f86
ClinVar              2020-10-03   2021-04-28 15:55:15  GRCh38     776050        ffff2168-89ac-4ff4-9ddd-788b26e4aa69
dbSNP                151          2021-04-28 16:23:59  GRCh37     685132792     c50ed9ad-443b-4321-9f37-7348a99275ec

# ClinVar for GRCh38 and GRCh37 are both available, and the correct annotation 
# is automatically applied.
$ biograph vdb study export my_study --anno ClinVar

Squared-off VCF

Exporting a large study can create an extremely large project-level VCF. While multi-terabyte VCF files are supported, they are unwieldy and can be prohibitively slow to process.

If your variant pipeline can process a single sample at a time, the --square-off option can save considerable time and space while requiring less expensive compute resource.

Use --square-off to specify a single sample name to be exported as VCF. This would ordinarily be repeated for every sample in the study, producing an equivalent number of VCF files. Each file contains every merged variant entry but only a single sample column. These VCFs can be created in parallel on several compute nodes at once and processed independently.

This allows for efficient external processing of merged variants, for steps such as joint genotyping, variant effect prediction, or other transformations that are more complex than simple filtering.

Once the desired analysis has been completed, the resulting single sample VCF files may then be re-imported into the vdb for further processing.

Other miscellaneous options

Sort order

The default sort order uses alphabetic sorting for chromosome order. To use natural ordering, include --chromosomal / -c .

$ biograph vdb study export my_study -c | bgzip > my_study.vcf.gz

Skip the VCF header

# only include variant entries, no # VCF header lines
$ biograph vdb study export my_study --no-header

Getting more help

$ biograph vdb study export --help
usage: biograph vdb study export [-h] [-o OUTPUT] [-f] [-a ANNO] [-r] [-t TMP]
                                 [-c] [--fields FIELDS]
                                 [--checkpoint CHECKPOINT]
                                 [--square-off SQUARE_OFF] [--no-header]
                                 [--threads THREADS]
                                 study_name

Export a study to a VCF file

positional arguments:
  study_name            Name of the study

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Write output VCF to this file (default: STDOUT)
  -f, --force           Overwrite local output directory without confirmation
  -a ANNO, --anno ANNO  Annotate the output with this annotation
  -r, --remerge         Force a merge prior to export, required when changing
                        --fields (default: use pre-merged data if possible)
  -t TMP, --tmp TMP     Temporary directory (/tmp)
  -c, --chromosomal     Use natural order (1,2,3,10,22,X) instead of
                        alphabetic order (1,10,2,22,3,X)
  --fields FIELDS       List of FORMAT fields to export, separated by :
                        (default: all fields)
  --checkpoint CHECKPOINT
                        Export the study from this checkpoint (default:
                        latest)
  --square-off SQUARE_OFF
                        Create a 'square-off' VCF with this single sample
                        column
  --no-header           Do not write a VCF header
  --threads THREADS     Number of threads to use (auto)
⚠️ **GitHub.com Fallback** ⚠️