1 | Usage - MoritzBlumer/winpca GitHub Wiki

1.1  |  Overview

WinPCA is structured into different methods that can be sequentially executed to run windowed PC analyses, process results and to create interactive plots. Methods are invoked by typing winpca {method}. A manual page for each method is available via winpca {method} -h.

winpca pca performs the windowed PC analyses on genotype or genotype likelihood data. When processing genotype likelihoods, core functions from PCAngsd are employed, otherwise scikit-allel is used for principal component analysis. winpca polarize attempts to harmonize the signs of the windows' PCs along a chromosome (by default, it is invoked by winpca pca but it can also be applied independently). winpca flip can reflect PC sign for an entire chromosome, or for specific windows and is useful to refine polarization manually. winpca chromplot takes existing results from a winpcarun and (with plotly) creates interactive plots or principal components which can be optionally annotated with metadata (e.g. sampling location, sequencing depth, ...) for a single chromosome. winpca genomeplot similarly generates a compound plot for multiple input chromosomes.

See the below sections for a detailed explanation of the command line arguments for each method. The final section describes the WinPCA config file, which allows for some additional adjustments that can't be directly controlled through the command line.


f_winpca_flowchart

Overview of WinPCA modules and workflows. The standard workflow for a single chromosome is indicated with solid black arrows. Optional operations (repolarization, flipping entire chromosomes or specific windows, supplying metadata for plotting) are shown as dotted black arrows. Separate runs for additional chromosomes as input for the ‘genomeplot’ module are indicated by solid gray arrows. Output formats other than HTML (e.g. PDF, PNG) may be specified by the user.



1.2  |   winpca pca

Perform windowed PCA on called genotypes (GT) or on genotype likelihoods (GL/PL).

positional arguments

PREFIX
       Prefix for all output files generated by pca. Can be a path including '/'.

VARIANT FILE
       Relative or absolute path to an optionally gzipped (.gz) input file containing variants. Variants may be hard-called genotypes (GT), genotype likelihoods (GL) or phred-scaled genotype likelihoods (PL). GT, GL or PL are supported for VCF or TSV input files, while BEAGLE files are expected to contain GL formatted likelihoods. Please refer to File Format Specifications for more details.

REGION
       Genomic region. Region must be specified in the format chrom:start-end, 'chrom' being the sequence ID in the variant file (e.g. the CHROM column in a VCF), 'start' the start position and 'end' the end position. To perform a WinPCA analysis for a an entire chr1 of size 12,562,894 bp, specify 'chr1:1-12562894'. Alternatively, only a part of chromosome can be specified.

optional arguments

-s/--samples
       Sample IDs to include in the analysis. IDs must be present in the specified variant file. Accepts either a comma-separated list (without white spaces) or a file with one sample ID per line. Provided sample IDs must be unique (no duplicate IDs).

-w/--window_size [ 1000000 ]
       Sliding window size in base pairs (bp).

-i/--increment [ 10000 ]
       Step size in base pairs (bp).

-m/--min_maf [ 0.01 ]
       Minor allele frequency threshold.

-p/--polarize [ auto ]
       Sign polarization strategy to be applied across all windows ('auto', 'guide_samples' or 'skip').

-g/--guide_samples
       One or more (-> comma-separated list) sample IDs to be used to guide PC sign polarization (applies only if 'guide_samples' is selected for -p/--polarize). The same samples can be specified multiple times which gives them more weight.

-v/--var_format [ GT ]
       Variant format ('GT', 'GL' or 'PL'). GL/PL invoke PCAngsd for principal component analysis.

-t/--threads [ 2 ]
       Number of threads. Multithreading is only used when by PCAngsd, i.e. when providing GL/PL variants.

--np
       "no pass filter": set this flag to disable the VCF PASS filter (overrides VCF_PASS_FILTER in modules/config.py).

--x
       Use SNP count (not bp) for window size (-w) and increment (-i). This automatically activates mean imputation and disables the MAF filter. Effectively this means that window size is variable but the number of SNPs is constant.



1.3  |   winpca polarize

(Re)-polarize windowed PC data from a previous run. Overwrites input data.

positional arguments

PREFIX
       Prefix used for this run in winpca pca command.

optional arguments

-c/--principal_component
       Specify which principal component(s) to re-polarize. The default is to polarize all PCs specified in PCS (modules/config.py).

-p/--polarize [ auto ]
       Sign polarization strategy to be applied across all windows ('auto' or 'guide_samples').

-g/--guide_samples
       One or more (-> comma-separated list) sample IDs to be used to guide PC sign polarization (applies only if 'guide_samples' is selected for -p/--polarize). The same samples can be specified multiple times which gives them more weight.

auto_polarization

Adaptive autopolarization. Demonstration of the WinPCA default polarization method applied to chromosome 2L of the Anopheles dataset used in the tutorial.



1.4  |   winpca flip

Flip/reflect windowed PC data from a previous run (multiply values by -1). Overwrites input data.

positional arguments

PREFIX
       Prefix used for this run in winpca pca command.

optional arguments

--r/--reflect
       Set flag to reflect the entire chromosome, i.e. flip all windows. --r/--reflect is applied independently from -w/--windows and both can be combined.

-w/--windows
       Comma-separated list of individual windows or regions to be flipped (e.g. 100000,150000,350000-550000). Alternatively, a file with one position/region per line can be supplied.

-c/--principal_component
       Specify which PC to flip. The default is to flip the first PC specified in PCS (modules/config.py).



1.5  |   winpca chromplot

Plot a principal component or heterozygosity (along with per window stats) for a specified input chromosome.

positional arguments

PREFIX
       Prefix used for this run in winpca pca command.

REGION
       Genomic region. See winpca pca for region format specifications.

optional arguments

-p/--plot_variable
       Specify which values to plot, e.g. "1" for PC 1 or "het" for SNP heterozygosity. The default is to plot the first PC specified in PCS (modules/config.py).

-m/--metadata
       Path to metadata file (TSV) where first column are sample IDs. Additional columns will be used to annotate data in HTML plot.

-g/--groups
       Metadata column for color-grouping. Requires -m/--metadata.

-c/--colors
       HEX codes (do not include '#') to use for color groups specified by -g/--groups. Must include all values present in the specified metadata column and formatted like: 'group_1:ff1100,group_2:0008ff,group_3:129c00' (these are example group names and HEX codes). ''

-i/--interval [ 5 ]
       If set, only plot values for every nth window (10 --> 10th). This option is to reduces plot file size and execution time.

-f/--format [ HTML ]
       Output plot file format ('HTML', 'PDF', 'SVG' or 'PNG'). Can also be a comma-separated list to produce the same plot in more than one format, e.g. 'HTML,PDF'.

--n/--numeric
       Set flag when specifying numeric data as group (-g/--groups). This will prompt a continuous color scale from smallest values (blue) to largest values (yellow).

--r/--reverse
       Reverse plotting order (e.g. when plotting numeric data with --n/--numeric).



1.6  |   winpca genomeplot

Plot a principal component or heterozygosity for multiple specified input chromosomes.

positional arguments

RUN PREFIX
       Prefix that is shared by all chromosomes runs to be included in genome-wide plot.

RUN IDs
       Comma-separated list of run IDs to include. RUN PREFIX+RUN ID+.suffix (e.g. '.pc_1.tsv.gz') will be used to load input files from different runs. The sequence of the specified ````RUN ID```s determines plotting order.

optional arguments

-p/--plot_variable
       Specify which values to plot, e.g. "1" for PC 1 or "het" for SNP heterozygosity. The default is to plot the first PC specified in PCS (modules/config.py).

-m/--metadata
       Path to metadata file (TSV) where first column are sample IDs. Additional columns will be used to annotate data in HTML plot.

-g/--groups
       Metadata column for color-grouping. Requires -m/--metadata.

-c/--colors
       HEX codes (do not include '#') to use for color groups specified by -g/--groups. Must include all values present in the specified metadata column and formatted like: 'group_1:ff1100,group_2:0008ff,group_3:129c00' (these are example group names and HEX codes). ''

-i/--interval [ 5 ]
       If set, only plot values for every nth window (10 --> 10th). This option is to reduces plot file size and execution time.

-f/--format [ HTML ]
       Output plot file format ('HTML', 'PDF', 'SVG' or 'PNG'). Can also be a comma-separated list to produce the same plot in more than one format, e.g. 'HTML,PDF'.

--n/--numeric
       Set flag when specifying numeric data as group (-g/--groups). This will prompt a continuous color scale from smallest values (blue) to largest values (yellow).

--r/--reverse
       Reverse plotting order (e.g. when plotting numeric data with --n/--numeric).



1.7  |   config.py

[Changing the config is not necessary for most applications] modules/config.py inside the WinPCA installation directory contains default values and furthermore parameter settings that can't be specified directly through command line arguments. Please be careful when adjusting default values and config settings. The current (2025-05-26) config looks like this:

'''
Configuration.
'''


## DEFAULTS
#  (overridden by CLI arguments, see documentation)

# winpca pca
VAR_FMT = 'GT'             # variant type (one of 'GT', 'GL', 'PL')
MIN_MAF = 0.01             # minor allele frequency threshold per window
W_SIZE = 1000000           # window size in bp
W_STEP = 100000            # step size in bp
N_THREADS = 2              # # of threads – only affects PCAngsd

# winpca polarize
POL_MODE = 'auto'          # polarization mode ('auto' or 'guide_samples')

# chromplot + genomeplot
PLOT_FMT = 'HTML'          # output  format (one of 'HTML', 'PDF', 'SVG', 'PNG)
PLOT_INTERVAL = 5          # plot only every nth value (5th if specifying 5)



## SETTINGS
#  (these can only be changed here, i.e. no CLI arguments)

# pca
PCS = [1, 2]               # PCs to output
PCANGSD_EM_EIG = 2         # sets PCAngsd '-eig' parameter (should be >= PCS)
PCANGSD_EM_ITER = 100      # max EM iterations to perform (0 --> ngsTools like)
GT_MIN_VAR_PER_W = 20      # min # of variants per window
GL_PL_MIN_VAR_PER_W = 100  # min # of variants per window
VCF_PASS_FILTER = True     # include only PASS sites (disable with --np)
SKIP_MONOMORPHIC = True    # skip invariant sites (uninformative for PCA)
GT_MEAN_IMPUTE = True      # mean-impute missing genotypes
PROC_TRAIL_TRUNC_W = True  # process truncated trailing window if present

# polarize
N_PREV_WINDOWS = 5         # # of previous windows to use for polarization

# chromplot
CHROMPLOT_W = 1200         # plot width in pixels
CHROMPLOT_H = 500          # plot height in pixels

# genomeplot
GENOMEPLOT_W = 12000       # plot width in pixels
GENOMEPLOT_H = 5000        # plot height in pixels

While the parameters in the DEFAULTS section are all command line options and explained above, the SETTING section contains additional parameter settings that determine the behaviour of the program, and they may be adjusted if appropriate.

PCS
       In most cases, PC1 will load the most interpretable axes of variation in windowed analyses. While WinPCA internally captures PCs 1-10, by default, output files are produced for PC1 and PC2 only. This behavior can be changed and any combination of available PCs (1-10) can be specified. E.g. [2, 4, 9] would generate output .tsv.gz files for PC2, PC4 and PC9. Other modules are also aware of this setting. winpca polarize will by default polarize all specified PCS while winpca flip, winpca chromplot and winpca genomeplot will take the first specified one. PCS should not be changed during a single WinPCA analysis since modules subsequent to winpca pca rely on this setting to infer what data is available.

PCANGSD_EM_EIG
       Sets PCAngsd '-eig' parameter, which determines the number of eigenvectors used to model allele frequencies. Setting it to 0 disables iterative approach and the covariance matrix, similar to automatic determination of eigenvectors (see PCAngsd documentation).

PCANGSD_EM_ITER
       Determines the maximum number of EM iterations in the PCAngsd emPCA() function. Setting it to 0 means no iteration (see PCAngsd documentation)

GT_MIN_VAR_PER_W
       Defines the minimal threshold of variants per window to perform a PCA. If less than the specified number of variants remain after internal filtering (e.g. min_maf filter), no PCA is performed, and the window is represented as missing data in the output. For GT data, >20 variable sites often still provide meaningful results, even though more sites are preferred, and it is not recommended to lower the threshold further.

GL_PL_MIN_VAR_PER_W
       Like GT_MIN_VAR_PER_W, but a slightly more stringent threshold for PCAs on genotype likelihood input data (GL, PL).

VCF_PASS_FILTER
       This only affects VCF input. Default behaviour is to only include sites that are annotated with a PASS filter. Set VCF_PASS_FILTER to False to disable this behaviour. The --np ("no pass filter") in the winpca pca overrides and the VCF_PASS_FILTER and sets it to False.

SKIP_MONOMORPHIC
       This only affects hard-called genotype (GT) input data. It is generally recommended to exclude monomorphic/invariant sites from PCA. Set SKIP_MONOMORPHIC to False to disable invariant sites being removed before PCA.

GT_MEAN_IMPUTE
       This only affects the handling of missing data with hard called genotype (GT) input. The default behaviour is to impute missing calls with the mean of the non-missing genotypes for the respective site. If four samples A, B, C and D had GT calls 0/0, 0/1, 1/1 and ./. for a given site, these would encode for homozygous ancestral, heterozygous, homozygous derived and missing, respectively. Internally (and in a TSV GT input file) they would be represented as 0 (homozygous ancestral), 1 (heterozygous), 2 (homozygous derived) and -1 (missing). Since PCA algorithms do not handle missing data well, any site/variant with at least one missing GT call (encoded as ./. or -1) would be removed from a window if GT_MEAN_IMPUTE is set to False. This has the advantage that only high quality variant calls are processed, and would be the preferred option in an ideal world. However, missing data is usually present even in high quality datasets, and with a growing number of samples, ever more sites will have at least a single sample with a missing call. Ultimately, this will often cause the majority of sites to be effectively removed from the analysis, even if 99% of individuals have high quality calls for these sites. For this reason, the default behaviour (GT_MEAN_IMPUTE = True) is to mean-impute missing genotype calls on the fly to prevent losing most of the information. Mean imputation is one of the simplest genotype imputation strategies and has limitations. In the above example, the mean of the non-missing calls (0 + 1 + 2)/3 = 1 would be filled in for the missing call. This allows to include variants despite missing genotype calls, but will cause samples with high missingness to be potentially misrepresented in the output. While mean imputation is a workaround and can produce acceptable results, we recommend to either impute high missingness datasets with a more sophisticated approach (e.g. with Eagle2) prior to running WinPCA, or to input genotype likelihoods (GL/PL, winpca pca option -v/--var_format) to entirely circumvent the issue of missing variant calls.

PROC_TRAIL_TRUNC_W
       Sometimes variants are left between the end of the last window and the end of the specified <REGION> (winpca pca) but there is not enough space (or variants in --xmode) to fit another window. The setting determines whether such a "truncated trailing window" is processed or ignored.

N_PREV_WINDOWS
       This affects the auto polarization method to harmonize the signs of principal components (eigenvectors) across windows of a chromosome. N_PREV_WINDOWS determines the number of previous windows to consider for the polarization decision of the current window while moving from the left-most to the right-most window. It might be worth trying larger values in specific cases.

CHROMPLOT_W, CHROMPLOT_H, GENOMEPLOT_W , GENOMEPLOT_H
       Define the plot dimensions of the output (unit: pixels). For vectorized output (PDF, SVG) and HTML this can be used to alter the aspect ration and to control relative font size.

⚠️ **GitHub.com Fallback** ⚠️