vcf2structure - christianparobek/cambodiaWGS GitHub Wiki
Unfortunately, our VCF file format must be converted to STRUCTURE format for both STRUCTURE and adegenet analysis.
Fortunately, this incredible software called PGDSpider will do this, and MANY other file format conversions. This software eliminates the need for genetics and bioinformatics PhD programs. PGDSpider requires Jave6RE or newer and has GUI and CLI modes.
For this conversion, I used the GUI mode because it walks you through making the required .spid file.
Although PGDSpider is awesome, it's not perfect. It converted our VCF file to STRUCTURE format, but it did not incorporate population information, instead naming all populations "1". To fix this problem, I made an awk script:
> gawk '{OFS = "\t"}
{if ($1 ~ /OM/) $2="1"}
{if ($1 ~ /BB/) $2="2"}
{if ($1 ~ /KP/) $2="3"}
{print}' good69.pass.str | sed -e 's/ [ ]*/\t/g' > 81k.str
# change OM -> 1, BB -> 2, KP -> 3
# awk has limit on number of fields, so use gawk
# the sed command replaces all spaces with tabs
I also need to double check a few SNP values from the STRUCTURE file to make sure that they match the SNP values found in the VCF files. I did, and it seems that PGDSpider is getting the calls right.
Now, need to subsample the SNPs. Use this simple bash script:
> string=`shuf -i 3-81837 -n 10000 | tr '\n' ','`
# select 10k random numbers in the range 3-XXXXX
# return a comma-separated list, rather than a \n-separated list
string1=${string::-1}
# remove the trailing comma behind the 10000th variant
cut -f 1,2,$string1 81k.str > 10k.str
# grab the 1st, 2nd, and 10000 sampled columns
# conveniently returns the sampled SNPs in order