tsv: creating a spreadsheet from a filtered VCF - brentp/slivar GitHub Wiki
slivar provides flexible filtering of VCFs. But when doing a final variant-by-variant analysis,
it's preferable to have the data in a spreadsheet--for clinicians and analysts.
slivar tsv enables this.
Human-readable output
In order to get these VCFs into a spreadsheet format that a clinician might use, one can use the slivar tsv subcommand.
This command can also use the gene annotations from VEP or bcftools and add other annotations using the gene name. For example, we can create a gene -> pLI lookup with this command:
wget -qO - https://storage.googleapis.com/gnomad-public/release/2.1.1/constraint/gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz \
| zcat \
| cut -f 1,21,24 | tail -n+2 \
| awk '{ printf("%s\tpLI=%.3g;oe_lof=%.5g\n", $1, $2, $3)}' > pli.lookup
The slivar tsv command allows specifying many of these gene -> value lookups. For example, it's often useful to have the gene description:
wget -qO - ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/gene_condition_source_id \
| cut -f 2,5 \
| grep -v ^$'\t' > clinvar_gene_desc.txt
slivar tsv \
-s denovo \ # indicate which INFO fields were added in previous slivar commands
-s x_denovo \
-s recessive \
-s x_recessive \
# any info fields to add
-i gnomad_popmax_af -i gnomad_popmax_af_filter -i gnomad_nhomalt \
# or CSQ if VEP was used
-c BCSQ \
# csq-column allows reporting specific entries from the |-delimited CSQ string
--csq-column ALLELE \
# this will lookup the pLI and description using the gene and add a column for each
-g pli.lookup \
-g clinvar_gene_desc.txt \
-p $ped \
vcfs/$cohort.vcf > $cohort-variants.tsv
# repeat for compound-hets VCF
slivar tsv \
-s slivar_comphet \
-i gnomad_popmax_af -i gnomad_popmax_af_filter -i gnomad_nhomalt \
-c BCSQ \
--csq-column ALLELE \
-g pli.lookup \
-g clinvar_gene_desc.txt \
-p $ped \
vcfs/$cohort.ch.vcf > $cohort-compound-hets.tsv
these 2 files will contain the same columns so they can be concatenated as needed.