2 | File format specifications - MoritzBlumer/winpca GitHub Wiki

2.1  |   Input files

VCF

Please adhere to the latest official specifications for VCF format. The last header line is used to infer sample IDs. Fields 1 and 2 are expected to be chromosome and position. The genotype field is expected to contain GT-formatted genotypes for winpca pca, or GL- or PL-formatted genotype likelihoods for winpca pcangsd. VCF files files may be gzip/bgzip compressed. For orientation on how to generate a VCF, consider using our Snakemake workflow.

Example of a VCF file with GT and PL fields:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=33145951>
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
#CHROM   POS     ID   REF   ALT   QUAL   FILTER   INFO   FORMAT   ind_1          ind_2          ind_3          [...]
chr1     69964   .    G     T     1      PASS     .      GT:PL    0/0:0,15,172   0/0:0,18,186   0/0:0,51,255
chr1     70381   .    G     T     1      PASS     .      GT:PL    0/0:0,90,255   0/1:37,0,200   0/1:48,0,166
chr1     74191   .    C     T     1      PASS     .      GT:PL    0/0:0,9,80     0/0:0,27,224   0/0:0,45,255
chr1     83728   .    T     C     1      PASS     .      GT:PL    0/0:0,6,58     0/0:0,9,95     0/0:0,33,255
chr1     86817   .    C     T     1      PASS     .      GT:PL    0/0:0,27,245   0/0:0,24,241   0/0:0,48,255
chr1     87124   .    T     A     1      PASS     .      GT:PL    0/0:0,27,240   0/0:0,36,255   0/0:0,42,255
chr1     87362   .    G     A     1      PASS     .      GT:PL    0/0:0,18,166   0/0:0,27,255   0/0:0,27,198
chr1     87602   .    T     G     1      PASS     .      GT:PL    0/0:0,18,174   0/0:0,39,255   0/0:0,75,255
[...]

BEAGLE

To generate a compatible BEAGLE file, follow the ANGSD instructions. The first field (marker column) must be formatted accordingly ({chromosome}_{position}), because WinPCA uses {position} to infer chromosomal coordinates. Fields 2 and 3 are ignored and could be filled with dummy data. The remaining fields are genotype likelihoods (GL), three columns per sample. BEAGLE files may be gzip/bgzip compressed.

Example of a BEAGLE file:

marker       allele1   allele2   ind_1   ind_1   ind_1   ind_2   ind_2   ind_2   ind_3   ind_3   ind_3   [...]
chr1_69964   0         3         0       15      172     0       18      186     0       51      255
chr1_70381   2         3         0       90      255     37      0       200     48      0       166
chr1_74191   2         3         0       9       80      0       27      224     0       45      255
chr1_83728   2         3         0       6       58      0       9       95      0       33      255
chr1_86817   1         2         0       27      245     0       24      241     0       48      255
chr1_87124   2         0         0       27      240     0       36      255     0       42      255
chr1_87362   1         2         0       18      166     0       27      255     0       27      198
chr1_87602   0         3         0       18      174     0       39      255     0       75      255
[...]

Custom TSV input file formats

In addition to VCF/BEAGLE, a custom TSV format is accepted for GT, GL and PL data. A single header line containing CHROM, POS and sample names is expected. Accordingly, fields 1 and 2 are expected to contain chromosome and position, and all remaining fields are either biallelic variants (one per sample, encoded as 0=ref, 1=het, 2=alt, -1=missing) or three GL or PL genotype likelihood values per sample. TSV files may be gzip/bgzip compressed.

Example of GT TSV file:

CHROM   POS     ind_1   ind_2   ind_3   [...]
chr1    69964   0       0       0
chr1    70381   0       1       1
chr1    74191   0       0       0
chr1    83728   0       0       0
chr1    86817   0       0       0
chr1    87124   0       0       0
chr1    87362   0       0       0
chr1    87602   0       0       0
[...]

Example of PL TSV file:

CHROM   POS     ind_1   ind_1   ind_1   ind_2   ind_2   ind_2   ind_3   ind_3   ind_3   [...]
chr1    69964   0       15      172     0       18      186     0       51      255
chr1    70381   0       90      255     37      0       200     48      0       166
chr1    74191   0       9       80      0       27      224     0       45      255
chr1    83728   0       6       58      0       9       95      0       33      255
chr1    86817   0       27      245     0       24      241     0       48      255
chr1    87124   0       27      240     0       36      255     0       42      255
chr1    87362   0       18      166     0       27      255     0       27      198
chr1    87602   0       18      174     0       39      255     0       75      255
[...]

Metadata file (optional)

A metadata file may be specified with the winpca chromplot or winpca genomeplot subcommands. The file is expected to have a single header line with column names, and the first column must contain the same sample IDs used in the variant file. All information contained in the metadata file will be included as hover display items in HTML output plots. Thus, it may be in the interest of the user to limit metadata fields to relevant information to avoid overloading the hover display boxes (i.e. don't include long file paths). When specifying a metadata column name as group -g with one of the plotting functions, samples will be grouped by their value in that column and groups will share the same color in the output plot (colors and plotting order can be controlled with -c). A metadata file for the three samples in the above exemplary variant files could look like this:

sample_id   species     country   coverage
ind_1       species_1   Spain     20X
ind_2       species_2   Finland   22X
ind_3       species_1   UK        18X

2.2  |   Output files

winpca generates two types of tab-separated data files per run: window stats and per-sample data files. Additionally, plots in HTML and/or PDF format can be generated with winpca chromplot and winpca genomeplot.

Per sample data files

Four files containing per-sample data: {prefix}.pc_1.tsv.gz contains per sample PC 1 values, {prefix}.pc_2.tsv.gz contains PC 2 values , {prefix}.hets.tsv.gz contains the number of heterozygous sites per window and {prefix}.miss.tsv.gz contains the number of missing variants per window. They all have the same format where rows are the genomic position (=midpoint of the window) and columns are individual samples.

Example of a per sample file: {prefix}.pc_1.tsv.gz:

pos       ind_1     ind_2    ind_3   [...]
500000    4.119     -3.349   -1.984
600000    -5.761    2.489    0.737
700000    1.895     -3.235   0.897
800000    1.054     -4.055   0.703
900000    0.922     -4.137   0.669
1000000   2.615     -5.914   -0.122
1100000   8.613     -5.499   -1.315
1200000   -10.686   4.843    1.233
[...]

Window stats file

One window stats file is created for each run, called {prefix}.stat.tsv.gz. It contains 16 columns: (1) window midpoint/pos, (2) window index, (3) window start position, (4) window stop position, (5) the number of included variants, (6-16) the percentage of variance explained by PC1 to PC10.

Example of window stats file: {prefix}.stat.tsv.gz:

pos       w_idx   w_start   w_stop   w_size     n_var   pc_1_ve   pc_2_ve   [...]   pc_9_ve	pc_10_ve
500000    0       1         1000000   1000000   27      41.72     20.59     [...]   0.62	0.47
600000    1       100001    1100000   1000000   28      40.72     20.37     [...]   0.61	0.5
700000    2       200001    1200000   1000000   29      36.27     24.52     [...]   0.62	0.51
800000    3       300001    1300000   1000000   28      38.37     23.67     [...]   0.58	0.47
900000    4       400001    1400000   1000000   35      50.78     18.24     [...]   0.46	0.38
1000000   5       500001    1500000   1000000   39      52.84     23.57     [...]   0.29	0.22
1100000   6       600001    1600000   1000000   52      58.44     29.16     [...]   0.18	0.13
1200000   7       700001    1700000   1000000   50      64.61     27.41     [...]   0.17	0.1
[...]

Output plots

Below are examples of chromosome (winpca chromplot) and genome-wide (winpca genomeplot) output plots. Supported export formats are HTML (interactive), PDF, SVG and PNG. Check out the tutorial to learn how to make them from scratch.

PC1 of 256 Anopheles gambiae along chromosome 2L:

SNP heterozygosity of the same chromosome:

PC1 values across all five chromosomes: