Analysis of contemporary and historical population stucture and admixture with 'dystruct' - barrettlab/2021-Genomics-bootcamp GitHub Wiki

Analysis of contemporary and historical population stucture and admixture with 'dystruct'

Joseph, T. A., & Pe'er, I. (2019). Inference of Population Structure from Time-Series Genotype Data. American journal of human genetics, 105(2), 317–333.

1. Convert thinned (LD) vcf file to eigensoft with PDGspider (gui application)

PDGSpider

Be sure to specify unix end of file characters

The format is columns (individuals) x rows (loci).

22222222222222222222222222222222222222112222222222222222222222222222222222222222222
99999999929999999992992222929999992922222922292122292222222292222929999222222292992
99202222222229292999922922222222229992992922222221222222222222222922299022222299299
99929919219929990999992212999999929992292292191112991122292191229919999192192099999
...

2. Create the generation times file. It is best to bin these! This is where you specify year collected. For an annual plant, the generation time = 1, so that makes things a bit easier.

1
1
1
1
1
...

3. Edit the 'run.sh' file, specifying genotype input and generation times file

#!/usr/bin/env bash
export OMP_NUM_THREADS=20

/usr/local/bin/dystruct/bin/dystruct --input striata_eigenstrat_dystruct.txt \
               --generation-times gentimes.txt \
               --output out \
               --npops 4 \
               --nloci 6589 \
               --seed 1145 \

4. Run the shell script to run dystruct/bin/dystruct

run.sh

It is necessary to run several iterations across the same k-value, choosign the best objective function (likelihood score)

This should be repeated for several values of k (e.g. 1-10), and you can use Evanno's delta K method to choose the best K-value, just like you would in STRUCTURE