2 | File format specifications - MoritzBlumer/winpca GitHub Wiki
2.1 | Input files
VCF
Please adhere to the latest official specifications for VCF format. The last header line is used to infer sample IDs. Fields 1 and 2 are expected to be chromosome and position. The genotype field is expected to contain GT-formatted genotypes for winpca pca
, or GL- or PL-formatted genotype likelihoods for winpca pcangsd
. VCF files files may be gzip/bgzip compressed. For orientation on how to generate a VCF, consider using our Snakemake workflow.
Example of a VCF file with GT and PL fields:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=33145951>
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ind_1 ind_2 ind_3 [...]
chr1 69964 . G T 1 PASS . GT:PL 0/0:0,15,172 0/0:0,18,186 0/0:0,51,255
chr1 70381 . G T 1 PASS . GT:PL 0/0:0,90,255 0/1:37,0,200 0/1:48,0,166
chr1 74191 . C T 1 PASS . GT:PL 0/0:0,9,80 0/0:0,27,224 0/0:0,45,255
chr1 83728 . T C 1 PASS . GT:PL 0/0:0,6,58 0/0:0,9,95 0/0:0,33,255
chr1 86817 . C T 1 PASS . GT:PL 0/0:0,27,245 0/0:0,24,241 0/0:0,48,255
chr1 87124 . T A 1 PASS . GT:PL 0/0:0,27,240 0/0:0,36,255 0/0:0,42,255
chr1 87362 . G A 1 PASS . GT:PL 0/0:0,18,166 0/0:0,27,255 0/0:0,27,198
chr1 87602 . T G 1 PASS . GT:PL 0/0:0,18,174 0/0:0,39,255 0/0:0,75,255
[...]
BEAGLE
To generate a compatible BEAGLE file, follow the ANGSD instructions. The first field (marker column) must be formatted accordingly ({chromosome}_{position}), because WinPCA uses {position} to infer chromosomal coordinates. Fields 2 and 3 are ignored and could be filled with dummy data. The remaining fields are genotype likelihoods (GL), three columns per sample. BEAGLE files may be gzip/bgzip compressed.
Example of a BEAGLE file:
marker allele1 allele2 ind_1 ind_1 ind_1 ind_2 ind_2 ind_2 ind_3 ind_3 ind_3 [...]
chr1_69964 0 3 0 15 172 0 18 186 0 51 255
chr1_70381 2 3 0 90 255 37 0 200 48 0 166
chr1_74191 2 3 0 9 80 0 27 224 0 45 255
chr1_83728 2 3 0 6 58 0 9 95 0 33 255
chr1_86817 1 2 0 27 245 0 24 241 0 48 255
chr1_87124 2 0 0 27 240 0 36 255 0 42 255
chr1_87362 1 2 0 18 166 0 27 255 0 27 198
chr1_87602 0 3 0 18 174 0 39 255 0 75 255
[...]
Custom TSV input file formats
In addition to VCF/BEAGLE, a custom TSV format is accepted for GT, GL and PL data. A single header line containing CHROM, POS and sample names is expected. Accordingly, fields 1 and 2 are expected to contain chromosome and position, and all remaining fields are either biallelic variants (one per sample, encoded as 0=ref, 1=het, 2=alt, -1=missing) or three GL or PL genotype likelihood values per sample. TSV files may be gzip/bgzip compressed.
Example of GT TSV file:
CHROM POS ind_1 ind_2 ind_3 [...]
chr1 69964 0 0 0
chr1 70381 0 1 1
chr1 74191 0 0 0
chr1 83728 0 0 0
chr1 86817 0 0 0
chr1 87124 0 0 0
chr1 87362 0 0 0
chr1 87602 0 0 0
[...]
Example of PL TSV file:
CHROM POS ind_1 ind_1 ind_1 ind_2 ind_2 ind_2 ind_3 ind_3 ind_3 [...]
chr1 69964 0 15 172 0 18 186 0 51 255
chr1 70381 0 90 255 37 0 200 48 0 166
chr1 74191 0 9 80 0 27 224 0 45 255
chr1 83728 0 6 58 0 9 95 0 33 255
chr1 86817 0 27 245 0 24 241 0 48 255
chr1 87124 0 27 240 0 36 255 0 42 255
chr1 87362 0 18 166 0 27 255 0 27 198
chr1 87602 0 18 174 0 39 255 0 75 255
[...]
Metadata file (optional)
A metadata file may be specified with the winpca chromplot
or winpca genomeplot
subcommands. The file is expected to have a single header line with column names, and the first column must contain the same sample IDs used in the variant file. All information contained in the metadata file will be included as hover display items in HTML output plots. Thus, it may be in the interest of the user to limit metadata fields to relevant information to avoid overloading the hover display boxes (i.e. don't include long file paths). When specifying a metadata column name as group -g
with one of the plotting functions, samples will be grouped by their value in that column and groups will share the same color in the output plot (colors and plotting order can be controlled with -c
).
A metadata file for the three samples in the above exemplary variant files could look like this:
sample_id species country coverage
ind_1 species_1 Spain 20X
ind_2 species_2 Finland 22X
ind_3 species_1 UK 18X
2.2 | Output files
winpca
generates two types of tab-separated data files per run: window stats and per-sample data files. Additionally, plots in HTML and/or PDF format can be generated with winpca chromplot
and winpca genomeplot
.
Per sample data files
Four files containing per-sample data: {prefix}.pc_1.tsv.gz
contains per sample PC 1 values, {prefix}.pc_2.tsv.gz
contains PC 2 values , {prefix}.hets.tsv.gz
contains the number of heterozygous sites per window and {prefix}.miss.tsv.gz
contains the number of missing variants per window. They all have the same format where rows are the genomic position (=midpoint of the window) and columns are individual samples.
Example of a per sample file: {prefix}.pc_1.tsv.gz
:
pos ind_1 ind_2 ind_3 [...]
500000 4.119 -3.349 -1.984
600000 -5.761 2.489 0.737
700000 1.895 -3.235 0.897
800000 1.054 -4.055 0.703
900000 0.922 -4.137 0.669
1000000 2.615 -5.914 -0.122
1100000 8.613 -5.499 -1.315
1200000 -10.686 4.843 1.233
[...]
Window stats file
One window stats file is created for each run, called {prefix}.stat.tsv.gz
. It contains 16 columns: (1) window midpoint/pos, (2) window index, (3) window start position, (4) window stop position, (5) the number of included variants, (6-16) the percentage of variance explained by PC1 to PC10.
Example of window stats file: {prefix}.stat.tsv.gz
:
pos w_idx w_start w_stop w_size n_var pc_1_ve pc_2_ve [...] pc_9_ve pc_10_ve
500000 0 1 1000000 1000000 27 41.72 20.59 [...] 0.62 0.47
600000 1 100001 1100000 1000000 28 40.72 20.37 [...] 0.61 0.5
700000 2 200001 1200000 1000000 29 36.27 24.52 [...] 0.62 0.51
800000 3 300001 1300000 1000000 28 38.37 23.67 [...] 0.58 0.47
900000 4 400001 1400000 1000000 35 50.78 18.24 [...] 0.46 0.38
1000000 5 500001 1500000 1000000 39 52.84 23.57 [...] 0.29 0.22
1100000 6 600001 1600000 1000000 52 58.44 29.16 [...] 0.18 0.13
1200000 7 700001 1700000 1000000 50 64.61 27.41 [...] 0.17 0.1
[...]
Output plots
Below are examples of chromosome (winpca chromplot
) and genome-wide (winpca genomeplot
) output plots. Supported export formats are HTML (interactive), PDF, SVG and PNG. Check out the tutorial to learn how to make them from scratch.
PC1 of 256 Anopheles gambiae along chromosome 2L:
SNP heterozygosity of the same chromosome:
PC1 values across all five chromosomes: