3 | Tutorial - MoritzBlumer/winpca GitHub Wiki

Chromosomal inversions in Anopheles gambiae

Below is an example of a typical WinPCA run. We will be using a simplified and subsampled VCF of 256 Anopheles gambiae individuals, all data is derived from the publicly available phase 2 dataset of the Ag1000G project.


Setup

To get started, download the tutorial files.

wget https://github.com/MoritzBlumer/winpca_tutorial/releases/download/v1.0.0/winpca_tutorial.zip
unzip winpca_tutorial.zip -d winpca_tutorial && rm winpca_tutorial.zip
cd winpca_tutorial

There should be three files (type ls -1): the variant file (VCF), a metadata file and a file containing chromosome names and sizes:

anopheles.chrom_sizes.tsv
anopheles.metadata.tsv
anopheles.vcf.gz

Lets inspect the VCF file first by typing zcat anopheles.vcf.gz | less -S:

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##contig=<ID=2L,length=49364325>
##contig=<ID=2R,length=61545105>
##contig=<ID=3L,length=41963435>
##contig=<ID=3R,length=53200684>
##contig=<ID=X,length=24393108>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	AB0085-C	AB0103-C	AB0104-C	AB0108-C
2L	160999	.	A	T	.	PASS	.	GT:PL	1/1:1126,87,0	1/1:1562,117,0	1/1:1705,132,0	1/1:621,48,0
2L	163050	.	T	G	.	PASS	.	GT:PL	1/1:1569,120,0	1/1:1386,108,0	1/1:1410,111,0	1/1:317,27,0
2L	165717	.	A	G	.	PASS	.	GT:PL	0/1:502,0,393	1/1:1124,99,0	1/1:1243,111,0	1/1:418,36,0
2L	166624	.	T	A	.	PASS	.	GT:PL	1/1:1272,99,0	1/1:1851,144,0	1/1:1487,120,0	1/1:462,39,0
2L	178181	.	C	T	.	PASS	.	GT:PL	1/1:703,54,0	1/1:1523,114,0	1/1:1640,123,0	1/1:358,30,0
2L	181138	.	T	C	.	PASS	.	GT:PL	1/1:1214,96,0	1/1:1869,144,0	1/1:1337,102,0	1/1:489,39,0
2L	182301	.	A	T	.	PASS	.	GT:PL	1/1:1129,96,0	1/1:1459,120,0	1/1:712,27,0	1/1:381,30,0
2L	185562	.	C	A	.	PASS	.	GT:PL	1/1:1011,84,0	1/1:718,69,0	1/1:1183,105,0	1/1:244,24,0
2L	186055	.	C	A	.	PASS	.	GT:PL	1/1:775,63,0	1/1:1316,108,0	1/1:1195,93,0	1/1:441,36,0
2L	186818	.	C	T	.	PASS	.	GT:PL	1/1:1314,105,0	1/1:1238,93,0	1/1:1378,105,0	1/1:414,33,0
[...]

Have a look at the VCF format specifications. From the header, we can learn about the reference chromosomes and their lengths contained in this file and about the variant encoding. Both genotype calls (GT) and phred-scaled genotype likelihoods (PL) are included. We will use hard-called genotypes in this tutorial, but will briefly describe how PLs could be used instead. The first ten data lines are shown above (and for space reasons only four genotype fields are displayed here). Only PASS sites that have passed all quality filters are present. The format field again tells us that for each variant both GT and PL are included. For example, at the first included position (2L:160999) sample AB0085-C has a homozygous call (1/1) for the alternative allele T. The phred-scaled corresponding genotype likelihoods are 1126,87,0.


Windowed PCA

The winpca pca subcommand serves to compute principal components in windows along the genome. It has to be executed separately per chromosome (or scaffold/contig). Let's first have a look at the help text by typing winpca pca -h:

usage: winpca pca [-h] [-s] [-w] [-i] [-m] [-p] [-g] [-v] [-t]
                  [--np] [--x]
                  <PREFIX> <VARIANT_FILE> <REGION>

positional arguments:
  <PREFIX>              prefix for this WinPCA analysis
  <VARIANT_FILE>        path to variant file: VCF(.gz), TSV(.gz) or BEAGLE(.gz)
  <REGION>              genomic region in format "chrom:start-end"

options:
  -h, --help            show this help message and exit
  -s, --samples     ​ ​ ​ ​ comma-separated list of samples OR file with one
                        sample ID per line
  -w, --window_size 
                        window size in base pairs [1000000]
  -i, --increment   ​ ​ ​ ​ step size in base pairs [10000]
  -m, --min_maf     ​ ​ ​ ​ minor allele frequency threshold [0.01]
  -p, --polarize    ​ ​ ​ ​ sign polarization strategy: auto, guide_samples or
                        skip [auto]
  -g, --guide_samples 
                        one or more (comma-separated) samples to guide PC
                        polarization (used only if "guide_samples" is selected for
                        -p/--polarize)
  -v, --var_format  ​ ​ ​ ​ variant format: GT, GL or PL (GL/PL invoke PCAngsd)
                        [GT]
  -t, --threads     ​ ​ ​ ​ number of threads (multi-threading is only
                        available with PCAngsd, i.e. GL/PL) [2]
  --np                  disable VCF PASS filter (overrides modules/config.py)
  --x                   use SNP count (not bp) for window size (-w) and increment
                        (-i); automatically activates mean imputation and disables
                        MAF filter

We must specify at least a prefix for the output files, the path to the variant file and the region. Additionally, we might want to include only a subset of samples that are in the variant file (-s), adjust the window size or step size (-w and -i), change the default minor allele frequency threshold (-m) or polarize using a guide sample (-p + -g). Let's for now stick with the default parameters and specify variant file, region (even if we want to include the whole chromosome) and a prefix for the output files (let's just use the chromosome name). For chromosome 2L of size 49,364,325 bp, this can be achieved by the following prompt:

winpca pca 2L anopheles.vcf.gz 2L:1-49364325

This will generate four additional files (ls -1); Have a look at the output file format specifications where they are explained in detail.

2L.hetp.tsv.gz
2L.pc_1.tsv.gz
2L.pc_2.tsv.gz
2L.stat.tsv.gz
anopheles.chrom_sizes.tsv
anopheles.metadata.tsv
anopheles.vcf.gz

Chromosomal plot

Next, we will have a look at the output by generating a chromosome-wide plot of PC1 values.Let's start with looking at the help menu for chromosome plots (winpca chromplot -h):

usage: winpca chromplot [-h] [-p] [-m] [-g] [-c] [-i] [-f] [--n] [--r]
                        <PREFIX> <REGION>

positional arguments:
  <PREFIX>            prefix for this WinPCA analysis
  <REGION>            genomic region in format "chrom:start-end"

options:
  -h, --help          show this help message and exit
  -p, --plot_var  ​ ​ ​ ​ specify what to plot, e.g. "1" for PC 1 or "het" for
                      SNP heterozygosity (default: first PC) [1]
  -m, --metadata  ​ ​ ​ ​ path to metadata TSV; first column must contain
                      sample IDs, all other columns are used to annotate data in
                      HTML plot
  -g, --groups    ​ ​ ​ ​ metadata column for color-grouping (requires
                      -m/--metadata)
  -c, --colors    ​ ​ ​ ​ HEX codes (drop "#"); check documentation for
                      formatting instructions (requires -g/--groups)
  -i, --interval  ​ ​ ​ ​ plot every nth window (e.g. "5" --> every 5th window)
                      [5]
  -f, --format    ​ ​ ​ ​ output plot format: HTML, PDF, SVG or PNG (can be a
                      comma-separated list)
  --n, --numeric      set flag to apply a continuous color scale (requires
                      numerical data for -g/--groups)
  --r, --reverse      set flag to reverse the plotting order (requires -g/--groups)

To color individuals by population in the output plot, we will use the supplied metadata file, let's first check it out by typing head anopheles.metadata.tsv:

sample_id       population      country location        sex     LATN    LONE    coverage
AB0103-C        BFgam   Burkina Faso    Bana    F       11.233  -4.472  31.85
AB0104-C        BFgam   Burkina Faso    Bana    F       11.233  -4.472  33.23
AB0211-C        BFgam   Burkina Faso    Bana    F       11.233  -4.472  35.06
AB0117-C        BFgam   Burkina Faso    Bana    M       11.233  -4.472  24.3
AB0118-C        BFgam   Burkina Faso    Bana    M       11.233  -4.472  34.57
AB0119-C        BFgam   Burkina Faso    Bana    M       11.233  -4.472  26.69
AB0126-C        BFgam   Burkina Faso    Bana    M       11.233  -4.472  22.76
AB0127-C        BFgam   Burkina Faso    Bana    M       11.233  -4.472  30.3
AB0128-C        BFgam   Burkina Faso    Bana    M       11.233  -4.472  30.44

We will provide the metadata file through -m and specify the 'population' column, for -g. Since this is for initial exploration of the results, we plot only every 10th window to reduce compute time and file size. Per default, PC1 is plotted and and the output format is HTML (this can be changed with options -p and -f). Let's create a plot for chromosome 2L, specifying the PREFIX '2L' that we used in the winpca pca run to help the chromplot module locate the input files. We furthermore specify the same REGION again.

winpca chromplot 2L 2L:1-49364325 -m anopheles.metadata.tsv -g population -i 10

After successful completion, an HTML file should have been created (2L.pc_1.html). Let's open it in a web browser and have a look: winpca_tutorial_chromplot_pc1

Each individual is represented by a line, and colors correspond with the four populations. The small panel on top shows the percentage of variance explained by each individual. We can see what looks like a polymorphic inversion at ~ 20-43 Mbp. This region also corresponds with an increased percentage of explained variance. Individuals homozygous for the inverted or non-inverted configuration sorting into the top and bottom bands, and heterozygous individuals forming the intermediate band. Try zooming in and out to inspect areas of interest or hide and show data groups by clicking on them in the legend.

Let's have a look at the rate of heterozygosity along the genome next, which has already been calculated. We only need to adjusting the plotting prompt by specifying 'het' for the -pargument:

winpca chromplot 2L 2L:1-49364325 -m anopheles.metadata.tsv -g population -i 10 -p het
winpca_tutorial_chromplot_hetp

This time, we can see two bands in the inversion region – the individuals heterozygous for the two inversion haplotypes (i.e. the ones from the central band in the PC1 plot) are forming the top band. Since they combine two divergent haplotypes in the inversion region they carry more heterozygous sites! You can spot check this for a few samples IDs using the hover display in both plots.


Batch processing

Let's now move on and run windowed PCAs and generate plots for all five chromosomes. We can do this in a loop by parsing chromosome names and sizes from the provided anopheles.chrom_sizes.tsv file. This time we will plot every 5th window (-i) for a slightly higher resolution. This will overwrite our previous 2L.pc_1.html file.

while IFS= read -r LINE ; do
  CHROM=$(echo "$LINE" | cut -f1)
  SIZE=$(echo "$LINE" | cut -f2)
  winpca pca ${CHROM} anopheles.vcf.gz ${CHROM}:1-${SIZE}
  winpca chromplot ${CHROM} ${CHROM}:1-${SIZE} -m anopheles.metadata.tsv -g population -i 5
done < <(cat anopheles.chrom_sizes.tsv)

Once all runs have completed without error, make sure that all expected output files are present (ls -1).


Flip windows/chromosomes manually

While browsing the new output files, you may realize that the auto-polarization wasn't 100% effective, and a few regions, e.g. 44450000-59900000 and the entire chromosome 3L should be flipped. When browsing at higher resolution, a few smaller regions will come up. These can be fixed using the winpca flipcommand. Let's look at the help menu first (-h):

usage: winpca flip [-h] [--r] [-w] [-c] <PREFIX>

positional arguments:
  <PREFIX>              prefix for this WinPCA analysis

options:
  -h, --help            show this help message and exit
  --r, --reflect        set flag to reflect the entire chromosome, i.e. to flip all windows (--r/--reflect is
                        applied independently from -w/--windows and they can be used together
  -w, --windows     ​ ​ ​ ​ comma-separated list of positions (e.g. 100000) and/or regions (e.g.
                        100000-250000) to flip OR file with one position/region per line
  -c, --components  ​ ​ ​ ​ principal components to flip (default: first PC) [1]

Again, we must provide the PREFIX to identify different runs, then we can use -w and --r to flip specific windows or entire chromosomes.

winpca flip 2R -w 44450000-59900000
winpca flip 3R --r
winpca flip X -w 1400000,4150000-4300000,5600000-8150000,20400000-21100000,21250000-23800000

Genome-wide plot

Having fixed the polarization, let's finally generate a PC1 plot of all chromosomes. This time, we manually specify the plotting order and plot color for the different populations. We need to supply a RUN_PREFIXthat is shared by all previous runs that we want to include. In our case, the run prefixes are the chromosome names and we can just specify './' to indicate that no shared prefix is used. and pass the a comma-separated list of chromosome names as the RUN_IDs. We furthermore specify HEX codes (without the '#') for all populations (-c) whereby the sequence of color specifications also determines plotting order of the groups. Finally, we also export a PDF version of the plot (-f).

# set custom colors and plotting order
COLORS='UGgam:5698f5,BFgam:fc8d05,GNgam:5db064,GHgam:c7342a'

# genomeplot
CHROMS='2L,2R,3L,3R,X'
winpca genomeplot ./ $CHROMS -g population -m anopheles.metadata.tsv -i 50 -f PDF,HTML -c $COLORS 

The genome-wide plots should now be present as genomeplot.pc_1.html and genomeplot.pc_1.pdf:

image
⚠️ **GitHub.com Fallback** ⚠️