VCF Format Deep Dive - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

VCF Format Deep Dive

  • Mandatory columns
    Each VCF record must include these 8 tab-delimited fields:

    1. CHROM – Reference sequence name (e.g. NC_000913.3)
    2. POS – 1-based position of the variant on CHROM
    3. ID – Variant identifier (e.g. dbSNP rsID) or . if none
    4. REF – Reference allele (one or more bases)
    5. ALT – Alternate allele(s), comma-separated if multiple
    6. QUAL – Phred-scaled quality score for the assertion
    7. FILTER – PASS or a semicolon-separated list of filters that failed
    8. INFO – Semicolon-separated additional annotations
  • INFO and FORMAT subfields

    • INFO is a list of key=value (or flag) pairs describing each variant.
      • e.g. DP=42;AF=0.25 means total depth 42 and alternate allele freq 25%.
    • FORMAT defines per-sample subfields in the genotype columns.
      • The first sample column header shows the FORMAT keys (e.g. GT:DP:GQ).
  • Genotype encoding (FORMAT fields)

    • GT – Genotype call; 0/0 = homozygous reference, 0/1 = het, 1/1 = hom alt
    • DP – Read depth at this site for that sample
    • GQ – Genotype quality (Phred-scaled confidence in the genotype call)

Example VCF record:

#CHROM POS     ID  REF ALT QUAL FILTER INFO           FORMAT    sample1
NC_000913.3  1234  .   A   G   60   PASS   DP=42;AF=0.25   GT:DP:GQ  0/1:42:99