3 | Tutorial - MoritzBlumer/winpca GitHub Wiki
Below is an example of a typical WinPCA run. We will be using a simplified and subsampled VCF of 256 Anopheles gambiae individuals, all data is derived from the publicly available phase 2 dataset of the Ag1000G project.
To get started, download the tutorial files.
wget https://github.com/MoritzBlumer/winpca_tutorial/releases/download/v1.0.0/winpca_tutorial.zip
unzip winpca_tutorial.zip -d winpca_tutorial && rm winpca_tutorial.zip
cd winpca_tutorial
There should be three files (type ls -1
): the variant file (VCF), a metadata file and a file containing chromosome names and sizes:
anopheles.chrom_sizes.tsv
anopheles.metadata.tsv
anopheles.vcf.gz
Lets inspect the VCF file first by typing zcat anopheles.vcf.gz | less -S
:
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##contig=<ID=2L,length=49364325>
##contig=<ID=2R,length=61545105>
##contig=<ID=3L,length=41963435>
##contig=<ID=3R,length=53200684>
##contig=<ID=X,length=24393108>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AB0085-C AB0103-C AB0104-C AB0108-C
2L 160999 . A T . PASS . GT:PL 1/1:1126,87,0 1/1:1562,117,0 1/1:1705,132,0 1/1:621,48,0
2L 163050 . T G . PASS . GT:PL 1/1:1569,120,0 1/1:1386,108,0 1/1:1410,111,0 1/1:317,27,0
2L 165717 . A G . PASS . GT:PL 0/1:502,0,393 1/1:1124,99,0 1/1:1243,111,0 1/1:418,36,0
2L 166624 . T A . PASS . GT:PL 1/1:1272,99,0 1/1:1851,144,0 1/1:1487,120,0 1/1:462,39,0
2L 178181 . C T . PASS . GT:PL 1/1:703,54,0 1/1:1523,114,0 1/1:1640,123,0 1/1:358,30,0
2L 181138 . T C . PASS . GT:PL 1/1:1214,96,0 1/1:1869,144,0 1/1:1337,102,0 1/1:489,39,0
2L 182301 . A T . PASS . GT:PL 1/1:1129,96,0 1/1:1459,120,0 1/1:712,27,0 1/1:381,30,0
2L 185562 . C A . PASS . GT:PL 1/1:1011,84,0 1/1:718,69,0 1/1:1183,105,0 1/1:244,24,0
2L 186055 . C A . PASS . GT:PL 1/1:775,63,0 1/1:1316,108,0 1/1:1195,93,0 1/1:441,36,0
2L 186818 . C T . PASS . GT:PL 1/1:1314,105,0 1/1:1238,93,0 1/1:1378,105,0 1/1:414,33,0
[...]
Have a look at the VCF format specifications. From the header, we can learn about the reference chromosomes and their lengths contained in this file and about the variant encoding. Both genotype calls (GT) and phred-scaled genotype likelihoods (PL) are included. We will use hard-called genotypes in this tutorial, but will briefly describe how PLs could be used instead. The first ten data lines are shown above (and for space reasons only four genotype fields are displayed here). Only PASS
sites that have passed all quality filters are present. The format field again tells us that for each variant both GT and PL are included. For example, at the first included position (2L:160999
) sample AB0085-C
has a homozygous call (1/1
) for the alternative allele T
. The phred-scaled corresponding genotype likelihoods are 1126,87,0
.
The winpca pca
subcommand serves to compute principal components in windows along the genome. It has to be executed separately per chromosome (or scaffold/contig). Let's first have a look at the help text by typing winpca pca -h
:
usage: winpca pca [-h] [-s] [-w] [-i] [-m] [-p] [-g] [-v] [-t]
[--np] [--x]
<PREFIX> <VARIANT_FILE> <REGION>
positional arguments:
<PREFIX> prefix for this WinPCA analysis
<VARIANT_FILE> path to variant file: VCF(.gz), TSV(.gz) or BEAGLE(.gz)
<REGION> genomic region in format "chrom:start-end"
options:
-h, --help show this help message and exit
-s, --samples comma-separated list of samples OR file with one
sample ID per line
-w, --window_size
window size in base pairs [1000000]
-i, --increment step size in base pairs [10000]
-m, --min_maf minor allele frequency threshold [0.01]
-p, --polarize sign polarization strategy: auto, guide_samples or
skip [auto]
-g, --guide_samples
one or more (comma-separated) samples to guide PC
polarization (used only if "guide_samples" is selected for
-p/--polarize)
-v, --var_format variant format: GT, GL or PL (GL/PL invoke PCAngsd)
[GT]
-t, --threads number of threads (multi-threading is only
available with PCAngsd, i.e. GL/PL) [2]
--np disable VCF PASS filter (overrides modules/config.py)
--x use SNP count (not bp) for window size (-w) and increment
(-i); automatically activates mean imputation and disables
MAF filter
We must specify at least a prefix for the output files, the path to the variant file and the region. Additionally, we might want to include only a subset of samples that are in the variant file (-s
), adjust the window size or step size (-w
and -i
), change the default minor allele frequency threshold (-m
) or polarize using a guide sample (-p
+ -g
). Let's for now stick with the default parameters and specify variant file, region (even if we want to include the whole chromosome) and a prefix for the output files (let's just use the chromosome name).
For chromosome 2L of size 49,364,325 bp, this can be achieved by the following prompt:
winpca pca 2L anopheles.vcf.gz 2L:1-49364325
This will generate four additional files (ls -1
); Have a look at the output file format specifications where they are explained in detail.
2L.hetp.tsv.gz
2L.pc_1.tsv.gz
2L.pc_2.tsv.gz
2L.stat.tsv.gz
anopheles.chrom_sizes.tsv
anopheles.metadata.tsv
anopheles.vcf.gz
Next, we will have a look at the output by generating a chromosome-wide plot of PC1 values.Let's start with looking at the help menu for chromosome plots (winpca chromplot -h
):
usage: winpca chromplot [-h] [-p] [-m] [-g] [-c] [-i] [-f] [--n] [--r]
<PREFIX> <REGION>
positional arguments:
<PREFIX> prefix for this WinPCA analysis
<REGION> genomic region in format "chrom:start-end"
options:
-h, --help show this help message and exit
-p, --plot_var specify what to plot, e.g. "1" for PC 1 or "het" for
SNP heterozygosity (default: first PC) [1]
-m, --metadata path to metadata TSV; first column must contain
sample IDs, all other columns are used to annotate data in
HTML plot
-g, --groups metadata column for color-grouping (requires
-m/--metadata)
-c, --colors HEX codes (drop "#"); check documentation for
formatting instructions (requires -g/--groups)
-i, --interval plot every nth window (e.g. "5" --> every 5th window)
[5]
-f, --format output plot format: HTML, PDF, SVG or PNG (can be a
comma-separated list)
--n, --numeric set flag to apply a continuous color scale (requires
numerical data for -g/--groups)
--r, --reverse set flag to reverse the plotting order (requires -g/--groups)
To color individuals by population in the output plot, we will use the supplied metadata file, let's first check it out by typing head anopheles.metadata.tsv
:
sample_id population country location sex LATN LONE coverage
AB0103-C BFgam Burkina Faso Bana F 11.233 -4.472 31.85
AB0104-C BFgam Burkina Faso Bana F 11.233 -4.472 33.23
AB0211-C BFgam Burkina Faso Bana F 11.233 -4.472 35.06
AB0117-C BFgam Burkina Faso Bana M 11.233 -4.472 24.3
AB0118-C BFgam Burkina Faso Bana M 11.233 -4.472 34.57
AB0119-C BFgam Burkina Faso Bana M 11.233 -4.472 26.69
AB0126-C BFgam Burkina Faso Bana M 11.233 -4.472 22.76
AB0127-C BFgam Burkina Faso Bana M 11.233 -4.472 30.3
AB0128-C BFgam Burkina Faso Bana M 11.233 -4.472 30.44
We will provide the metadata file through -m
and specify the 'population' column, for -g
. Since this is for initial exploration of the results, we plot only every 10th window to reduce compute time and file size. Per default, PC1 is plotted and and the output format is HTML (this can be changed with options -p
and -f
). Let's create a plot for chromosome 2L, specifying the PREFIX
'2L' that we used in the winpca pca
run to help the chromplot
module locate the input files. We furthermore specify the same REGION
again.
winpca chromplot 2L 2L:1-49364325 -m anopheles.metadata.tsv -g population -i 10
After successful completion, an HTML file should have been created (2L.pc_1.html
). Let's open it in a web browser and have a look:
Each individual is represented by a line, and colors correspond with the four populations. The small panel on top shows the percentage of variance explained by each individual. We can see what looks like a polymorphic inversion at ~ 20-43 Mbp. This region also corresponds with an increased percentage of explained variance. Individuals homozygous for the inverted or non-inverted configuration sorting into the top and bottom bands, and heterozygous individuals forming the intermediate band. Try zooming in and out to inspect areas of interest or hide and show data groups by clicking on them in the legend.
Let's have a look at the rate of heterozygosity along the genome next, which has already been calculated. We only need to adjusting the plotting prompt by specifying 'het' for the -p
argument:
winpca chromplot 2L 2L:1-49364325 -m anopheles.metadata.tsv -g population -i 10 -p het

This time, we can see two bands in the inversion region – the individuals heterozygous for the two inversion haplotypes (i.e. the ones from the central band in the PC1 plot) are forming the top band. Since they combine two divergent haplotypes in the inversion region they carry more heterozygous sites! You can spot check this for a few samples IDs using the hover display in both plots.
Let's now move on and run windowed PCAs and generate plots for all five chromosomes. We can do this in a loop by parsing chromosome names and sizes from the provided anopheles.chrom_sizes.tsv
file. This time we will plot every 5th window (-i
) for a slightly higher resolution. This will overwrite our previous 2L.pc_1.html
file.
while IFS= read -r LINE ; do
CHROM=$(echo "$LINE" | cut -f1)
SIZE=$(echo "$LINE" | cut -f2)
winpca pca ${CHROM} anopheles.vcf.gz ${CHROM}:1-${SIZE}
winpca chromplot ${CHROM} ${CHROM}:1-${SIZE} -m anopheles.metadata.tsv -g population -i 5
done < <(cat anopheles.chrom_sizes.tsv)
Once all runs have completed without error, make sure that all expected output files are present (ls -1
).
While browsing the new output files, you may realize that the auto-polarization wasn't 100% effective, and a few regions, e.g. 44450000-59900000 and the entire chromosome 3L should be flipped. When browsing at higher resolution, a few smaller regions will come up. These can be fixed using the winpca flip
command. Let's look at the help menu first (-h
):
usage: winpca flip [-h] [--r] [-w] [-c] <PREFIX>
positional arguments:
<PREFIX> prefix for this WinPCA analysis
options:
-h, --help show this help message and exit
--r, --reflect set flag to reflect the entire chromosome, i.e. to flip all windows (--r/--reflect is
applied independently from -w/--windows and they can be used together
-w, --windows comma-separated list of positions (e.g. 100000) and/or regions (e.g.
100000-250000) to flip OR file with one position/region per line
-c, --components principal components to flip (default: first PC) [1]
Again, we must provide the PREFIX
to identify different runs, then we can use -w
and --r
to flip specific windows or entire chromosomes.
winpca flip 2R -w 44450000-59900000
winpca flip 3R --r
winpca flip X -w 1400000,4150000-4300000,5600000-8150000,20400000-21100000,21250000-23800000
Having fixed the polarization, let's finally generate a PC1 plot of all chromosomes. This time, we manually specify the plotting order and plot color for the different populations. We need to supply a RUN_PREFIX
that is shared by all previous runs that we want to include. In our case, the run prefixes are the chromosome names and we can just specify './' to indicate that no shared prefix is used. and pass the a comma-separated list of chromosome names as the RUN_IDs
.
We furthermore specify HEX codes (without the '#') for all populations (-c
) whereby the sequence of color specifications also determines plotting order of the groups. Finally, we also export a PDF version of the plot (-f
).
# set custom colors and plotting order
COLORS='UGgam:5698f5,BFgam:fc8d05,GNgam:5db064,GHgam:c7342a'
# genomeplot
CHROMS='2L,2R,3L,3R,X'
winpca genomeplot ./ $CHROMS -g population -m anopheles.metadata.tsv -i 50 -f PDF,HTML -c $COLORS
The genome-wide plots should now be present as genomeplot.pc_1.html
and genomeplot.pc_1.pdf
:
