Flags and Settings - cancerit/cgpCaVEManPostProcessing GitHub Wiki
This page describes the flags that can be applied to substitution data and the settings available to modify them.
This section explains the contents of the flag.to.vcf.convert.ini
file, with the addition of a more detailed explanation of why we use the flag.
-
id
the id of the flag. This value is in the FILTER field should the position fail the flag -
description
may contain variable names. These correspond to variables in theflag.to.vcf.convert.ini
file and are the variables that can be adjusted with sequence-type or species. -
info=[1|0]
info=1 defines that this is an INFO field flag ('soft flag') rather than a true filter (FILTER field). - If info=1 more fields are permitted:
-
type
[Flag|Float etc] The type of INFO entry this corresponds to. See the VCF spec -
val
[1|0] If1
this INFO field entry has a corresponding value -
intersect
1 if this flag requires checking against a reference file -
optname
The short name of the commandline option containing the file to intersect with -
filename
The filename name passed to the commandline option containing the file to intersect with
This flag ensures we have reasonable depth of real alleles in the tumour sample. Using base quality as a filter for realness.
- info=0
- id=DTH
- description=Less than
depthCutoffProportion
mutant alleles were >=minDepthQual
base quality
This flag tries to account for the drop in accuracy towards the end of each read.
- info=0
- id=RP
- description=Coverage was less than minRdPosDepth and no mutant alleles were found in the first 2/3 of a read (shifted
readPosBeginningOfReadIgnoreProportion
from the start and extendedreadPosTwoThirdsOfReadExtendProportion
more than 2/3 of the read length)
Looks for reads in the normal sample containing the high base quality mutant alleles.
- info=0
- id=MN
- description=More than
maxMatchedNormalAlleleProportion
of mutant alleles that were >=minNormMutAllelequal
base quality found in the matched normal
Presence of the motif GGC[AT]G
in sequenced orientation causes a drop in mean base quality and largely increased base miscalling - this leads to increased false positives if not filtered.
- info=0
- id=PT
- description=Mutant alleles all on one direction of read (1rd allowed on opposite strand) and in second half of the read. Second half of read contains the motif GGC[AT]G in sequenced orientation and the mean base quality of all bases after the motif was less than
pentamerMinPassAvgQual
Ensures good evidence is present in the tumour for a mutation by checking the mean mapping quality of mutant allele containing reads.
- info=0
- id=MQ
- description=Mean mapping quality of the mutant allele reads was <
minPassAvgMapQual
Germline indels can appear to be substitutions as part of the mapping process. This flag excludes germline indel positions.
- info=0
- id=GI
- description=Position falls within a germline indel using the supplied bed file
- intersect=1
- optname=g
A high proportion of indels in reads mapped that also cover the mutant position can lead to an increased likelihood of false positive.
- info=0
- id=TI
- description=More than
maxTumIndelProportion
percent of reads covering this position contained an indel according to mapping
Useful only in paired end, non PCR based sequencing types. Checks for piling up of mutant alleles at the same read position.
- info=0
- id=SRP
- description=More than
samePosMaxPercent
percent of reads contain the mutant allele at the same read position
Simple repeats can cause sequencing to encounter something similar to slippage in capillary sequencing. Flagging simple repeats can remove false positives.
- info=1
- val=0
- id=SR
- type=Flag
- description=Position falls within a simple repeat using the supplied bed file
- intersect=1
- optname=b
- filename=simple_repeats.bed.gz
Centromeric repeats may lead to mismapped regions. Excluding these reduces false positives.
- info=0
- id=CR
- description=Position falls within a centromeric repeat using the supplied bed file
- optname=b
- filename=centromeric_repeats.bed.gz
- intersect=1
A soft flag (in the INFO field), represents a position known to contain a SNP via intersecting with the provided bgzipped, tabix indexed bed file of SNPs.
- info=1
- val=0
- id=SNP
- type=Flag
- description=Position matches a dbSNP entry using the supplied bed file
- intersect=1
- optname=b
- filename=snps.bed.gz
Phasing is a sequencing artefact where 'bleed through' of the adjacent bases causes a position to appear as a SNP. This is most commonly seen where the middle base of a triplet appears to be the same as neighbouring bases (ACA -> AAA).
- info=0
- id=PH
- description=Mutant reads were on one strand (permitted proportion on other strand:
maxPhasingMinorityStrandReadProportion
), and mean mutant base quality was less thanminPassPhaseQual
Useful in sequencing types targeting genes and introns etc data (eg WXS) where positions are expected to be annotatable. This flag uses a bgzipped, tabix indexed bed file of annotatable positions to exclude those positions that can't be annotated. NB This is NOT the same as the coding flag
- info=0
- id=AN
- description=Position could not be annotated against a transcript using the supplied bed file
- intersect=1
- optname=ab
- filename=gene_regions.bed.gz
Filters regions of high sequencing depth, usually caused by mismapped in the alignment process, therefore highly likely to be false positives.
- info=0
- id=HSD
- description=Position falls within a high sequencing depth region using the supplied bed file
- intersect=1
- optname=b
- filename=hi_seq_depth.bed.gz
Useful in sequencing types targeting coding regions etc (eg AMPLICON) where positions are expected to be coding. This soft flag (INFO field) uses a bgzipped, tabix indexed bed file of coding positions to exclude those positions that aren't annotated as coding.
- info=1
- val=0
- id=CA
- description=Position could not be annotated to a coding region of a transcript using the supplied bed file
- intersect=1
- type=Flag
- optname=ab
- filename=codingexon_regions.sub.bed.gz
Excludes mutations where the variant allele fraction in the tumour sample is less than 10%. Useful where specificity is required. It is inadvisable to apply this flag to data where subclonal mutations are being investigated.
- info=0
- id=LMB
- description=Proportion of mutant alleles was < 10 pct
Excludes mutations where the change from reference to mutant allele matches an entry in the provided unmatched normal panel. Called VCF for historical reasons. A bed file is more efficient.
- info=0
- id=VUM
- description=Position has >=
vcfUnmatchedMinMutAlleleCvg
mutant allele present in at leastvcfUnmatchedMinSamplePct
percent unmatched normal samples in the unmatched VCF. - optname=umv
Remove mutations with coverage on both strands but evidence only on one strand.
- info=0
- id=SE
- description=Coverage is >=
minSingleEndCoverage
on each strand but mutant allele is only present on one strand
Filter positions called where normal variant allele fraction (VAF) and tumour sample VAF are not sufficiently different.
- info=0
- id=MNP
- description=Tumour sample mutant allele proportion - normal sample mutant allele proportion <
matchedNormalMaxMutProportion
A soft flag (INFO field), this time with an associated value. Useful in BWA-mem mapped data, the alignment score includes the number of clipped bases.
Excessive clipping can lead to false positives and poorly mapped reads. By providing the median alignment score of reads presenting the variant allele we provide a filtering opportunity on the side of the user. This value is adjusted for the length of the reads, whereas alnScoreMedianFlag is not. This allows for a standard cutoff in all data rather than on a read length basis.
- info=1
- val=1
- id=ASRD
- type=Float
- description=A soft flag median (read length adjusted) alignment score of reads showing the variant allele
A soft flag (INFO field), this time with an associated value. Useful in BWA-mem mapped data, another flag taking the number of clipped bases in variant supporting reads into account.
Excessive clipping can lead to false positives and poorly mapped reads. By providing the median count clipped bases in reads presenting the variant allele we provide a filtering opportunity on the side of the user.
- info=1
- val=1
- type=Float
- id=CLPM
- description=A soft flag median number of soft clipped bases in variant supporting reads
A soft flag (INFO field), this time with an associated value. Useful in BWA-mem mapped data, the alignment score includes the number of clipped bases.
Excessive clipping can lead to false positives and poorly mapped reads. By providing the median alignment score of reads presenting the variant allele we provide a filtering opportunity on the side of the user.
- info=1
- val=1
- type=Float
- id=ASMD
- description=A soft flag median alignment score of reads showing the variant allele
This new flag was developed alongside the DERMATLAS project. It is not applied by default and should be used with caution. Like the MatchedNormalProportionFlag, however instead of using bam file reads, metrics in this flag are obtained from the CaVEMan VCF per sample depth outputs.
- info=0
- id=CMNP
- description=Tumour sample mutant allele proportion - normal sample mutant allele proportion <
maxCavemanMatchedNormalProportion
(differs from MNP in using CaVEMan only seen reads as per VCF)
This new flag was developed alongside the DERMATLAS project. It is not applied by default and should be used with caution. Caveman sometimes calls false positive variants in misaligned reads (mostly near the beginning/end of a read), near in/dels. Some of our samples have a higher than average number of indels (and many are in repeats, due to microsatellite instability) so this is not rare. We also cannot use the 'SR' flag as we do see real mutations in simple repeats. (A large number of our samples will have a high mutation rate due to various DNA repair deficiencies and UV light exposure).
- info=0
- id=GAP
- description=If variant is within
withinXBpOfDeletion
of an indel in reads without the variant and the indel is present in at leastminGapPresentInReads
percent of total reads and no variant reads have the indel.
Developed alongside the DERMATLAS project. If included, only MNVs will be flagged using the mnv flag. After all flags under MNVFLAGLIST are applied to each SNV base within the MNV, mnvFlag will fail if ALL bases in the MNV fail all applied flags. Should one pass then the MNV itself will PASS. All failed flags are also stored under an INFO
tag.
- info=0
- id=MNV
- description=If any base in an MNV passes all other SNV flags, pass this variant. If all fail fail and merge failure list as well as adding INFO fields holding failed flags per base.
All the highlighted settings highlighted in the flags section above are available for modification. We provide an example file for human in our distribution (flag.to.vcf.convert.ini
). The file has several sections that are required on a per species, sequence type basis. The naming pattern of the sections is as follows:
<SPECIES>_<SEQ_TYPE> <SECTION_NAME>
, so for the flaglist section of a human genomic data section would be HUMAN_WGS FLAGLIST
Contains the parameters listed above and related values.
The list of flags to be applied to this sequencing type/species combination. The flag name is equal to the section title in the flag.to.vcf.convert.ini
file.
List of flags to be applied to MNVs (only applied if mnvFlag
is included in FLAGLIST)
The name (without location) of the bgzipped bed files used by certain flags (for example snps.gz).