study export - spiralgenetics/biograph GitHub Wiki
The biograph vdb study export command performs the following actions:
- All variant entries in the study at the given checkpoint are merged, creating a single representation for every unique chrom + pos + ref + alt.
- Merged variants are optionally annotated with a desired vdb annotation.
- The merged and optionally annotated results are exported to a local VCF file.
The duration of the merge process depends on the number of variants in the study, scaling at about 30 to 40 seconds per sample for WGS. Smaller studies merge in a minute or two, while a large study of 600 samples should merge in about 15 minutes.
Sorted, uncompressed VCF is written to STDOUT by default. This may be piped through bgzip for compression, or written directly to a file using --output / -o.
$ biograph vdb study export my_study -o my_study.vcf
$ biograph vdb study export my_study | bgzip > my_study.vcf.gz
The latest checkpoint is used for export by default. Use --checkpoint to export an earlier checkpoint. This can be useful for comparing the merged output before and after a filtering step, without the need to apply additional filters.
$ biograph vdb study show my_study
study_name: my_study
created_on: 2021-05-18 13:02:51
build: GRCh38
refname: grch38
checkpoints:
1: added HG002: 2e7a0129-13a5-44b6-8594-2fc2e6c80e6c; HG003: a98626d6-758b-468d-add9-fbfbac47d207; HG004: 9a174215-8fc5-4c6f-bc4b-134654f65b99
2: include GT = '0/1'
3: include SVTYPE = 'INS'
sample_name variant_count
HG002 16447
HG003 15883
HG004 18035
# export het SV inserts only
$ biograph vdb study export my_study
# export all hets
$ biograph vdb study export my_study --checkpoint 2
# export all variants
$ biograph vdb study export my_study --checkpoint 1
If not all FORMAT fields are required, use the --fields option to specify a colon separated list of desired fields. Using fewer fields results in a smaller VCF and faster export times. The GT field is always included in the output, even if it is not specified.
# Only include genotype, depth, and allelic depth
$ biograph vdb study export my_study --fields GT:DP:AD
A merge is performed automatically when a checkpoint is exported for the first time. Subsequent exports use the existing merge data for speed. However, if the exported fields have changed using the --fields option, a subsequent export will use the same fields that were requested at the initial export.
To force a remerge and overwrite the existing merge data, include the --remerge / -r option.
Use --anno to add variant annotations to the exported VCF. Specify the name of the annotation to be applied. Use biograph vdb anno import to import new annotations.
An annotation can only be applied if it uses a compatible genetic reference. Export will abort if no annotation exists with the given name and a reference that matches the study.
# This study uses GRCh38, but the only available dbSNP uses GRCh37
$ biograph vdb study export my_study --anno dbSNP
There is no annotation named dbSNP with a matching reference build.
$ biograph vdb anno list
anno_name version imported_on build annotations aid description
ClinVar 2020-10-03 2021-04-28 15:55:55 GRCh37 775850 c8d42e04-80ae-415d-a848-2b162dcb7f86
ClinVar 2020-10-03 2021-04-28 15:55:15 GRCh38 776050 ffff2168-89ac-4ff4-9ddd-788b26e4aa69
dbSNP 151 2021-04-28 16:23:59 GRCh37 685132792 c50ed9ad-443b-4321-9f37-7348a99275ec
# ClinVar for GRCh38 and GRCh37 are both available, and the correct annotation
# is automatically applied.
$ biograph vdb study export my_study --anno ClinVar
Exporting a large study can create an extremely large project-level VCF. While multi-terabyte VCF files are supported, they are unwieldy and can be prohibitively slow to process.
If your variant pipeline can process a single sample at a time, the --square-off option can save considerable time and space while requiring less expensive compute resource.
Use --square-off to specify a single sample name to be exported as VCF. This would ordinarily be repeated for every sample in the study, producing an equivalent number of VCF files. Each file contains every merged variant entry but only a single sample column. These VCFs can be created in parallel on several compute nodes at once and processed independently.
This allows for efficient external processing of merged variants, for steps such as joint genotyping, variant effect prediction, or other transformations that are more complex than simple filtering.
Once the desired analysis has been completed, the resulting single sample VCF files may then be re-imported into the vdb for further processing.
The default sort order uses alphabetic sorting for chromosome order. To use natural ordering, include --chromosomal / -c .
$ biograph vdb study export my_study -c | bgzip > my_study.vcf.gz
# only include variant entries, no # VCF header lines
$ biograph vdb study export my_study --no-header
$ biograph vdb study export --help
usage: biograph vdb study export [-h] [-o OUTPUT] [-f] [-a ANNO] [-r] [-t TMP]
[-c] [--fields FIELDS]
[--checkpoint CHECKPOINT]
[--square-off SQUARE_OFF] [--no-header]
[--threads THREADS]
study_name
Export a study to a VCF file
positional arguments:
study_name Name of the study
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Write output VCF to this file (default: STDOUT)
-f, --force Overwrite local output directory without confirmation
-a ANNO, --anno ANNO Annotate the output with this annotation
-r, --remerge Force a merge prior to export, required when changing
--fields (default: use pre-merged data if possible)
-t TMP, --tmp TMP Temporary directory (/tmp)
-c, --chromosomal Use natural order (1,2,3,10,22,X) instead of
alphabetic order (1,10,2,22,3,X)
--fields FIELDS List of FORMAT fields to export, separated by :
(default: all fields)
--checkpoint CHECKPOINT
Export the study from this checkpoint (default:
latest)
--square-off SQUARE_OFF
Create a 'square-off' VCF with this single sample
column
--no-header Do not write a VCF header
--threads THREADS Number of threads to use (auto)