Deprecated 17. Calling RADSeq SNPs with Stacks - davidaray/Genomes-and-Genome-Evolution GitHub Wiki
Calling variants using reduced representation sequencing is often done via RADSeq. For this exercise, you will work through a data set we used for examining some gopher populations in Texas. For any analysis such as this, the basic pipeline is as follows:
Step 1. Process the raw sequence data using the process_radtags function in Stacks. This effort will identify the indices (patterns in the sequencing reads that identify the sample or population), the restriction enzyme used, the adapters used. It will also demultiplex the data and output the filtered reads to a directory of your choice.
Fortunately for us and for most everyone these days, this first step is usually performed by the sequencing facility but if you need to do it yourself one day, you can find the instructions at this site.
Step 2. Map the reads to a reference genome, if you have one.
I've already had you do this multiple times. So, for the sake of streamlining I'm going to just give you these files. I indexed the genome assembly for one member of this species and mapped the reads using bowtie2. Be warned, if you have lots of samples and lots of data per sample, this process can take days.
Step 3. Analyze the data using the two main tools in Stacks, gstacks and populations. gstacks
will call SNPs for each sample and populations
will generate population-level statistics.
Step 4. Filter the resulting SNP data using your predetermined criteria. Downstream analyses and comparisons.
We've already done this as part of a previous exercise, so we'll skip it.
Step 5. While not absolutely necessary, we will also examine the data a little more closely by using a software package called Structure. Structure examines genetic data using a Bayesian framework to identify evidence of distinct allele distributions. We will also use some related packages to visualize the results.
THE SETUP
Install stacks with conda. https://anaconda.org/bioconda
You'll need one top directory and four subdirectories. You know how to make them.
/lustre/scratch/[eraider]/gge2022/gopher_rad
/lustre/scratch/[eraider]/gge2022/gopher_rad/bin
/lustre/scratch/[eraider]/gge2022/gopher_rad/info
/lustre/scratch/[eraider]/gge2022/gopher_rad/sorted_bams
/lustre/scratch/[eraider]/gge2022/gopher_rad/gstacks
/lustre/scratch/[eraider]/gge2022/gopher_rad/populations
/lustre/scratch/[eraider]/gge2022/gopher_rad/structure/harvester
You should also copy a bin directory with some files you will need.
cp /lustre/scratch/daray/gge2022/bin/rad/* /lustre/scratch/[eraider]/gge2020/gopher_rad/bin
RUNNING STACKS
For the steps below, read and answer the questions before actually running the associated commands. That's important because you need to make a decision on how to set up your working environment and whether or not you want to run the commands using an interactive session or with a submission script. Either way, time should not be an issue if you set yourself up correctly.
This command runs gstacks and should be run from the gopher_rad directory: gstacks -I sorted_bams/ -M bin/bursarius_popmap.txt -O gstacks -t 36 --details
Starting here, submit your responses to Assignment 17 - Stacks.
- Using the 'gstacks' documentation available at the stacks website, explain what each part of this command does. Also take some time to explain what the gstacks module actually does.
Once that's done, run the populations module. FYI. This takes a significant amount of time. I don't remember how long but it was over an hour. You should probably consider running it as a submission script if your connection is not stable.
populations -P gstacks --popmap bin/bursarius_popmap.txt -O populations -r .75 -t 36 --vcf --structure --hwe --fstats --fasta-samples --genepop --fasta-loci --ordered-export --phylip --smooth --fasta-samples-raw --write-single-snp
-
Using the 'populations' documentation available at the stacks website, explain what each part of this command does. Also take some time to explain what the populations module actually does.
-
Use
head
to get all data for Locus 1 in thepopulations.samples.fa
file. How many individuals are polymorphic for that locus? What are the identities of those individuals? What is the polymorphism? In other words, how are the alleles different from one another? Are the individuals homozygous or heterozygous for that polymorphism?
There's a bunch of files that describe the data and how they fit together. Structure is used to make sense of that data by determining how many distinct genetic groups are likely to exist. The details of such a population genetic analysis is beyond the scope of this course but a brief description is necessary. Structure will examine the data under various scenarios called K. K is the presumed number of genetic subdivisions that you suggest. Typically, you suggest K ranging anywhere from 1 to a value that makes some sort of sense in the context of what you already know. In this case and for this data, we were interested in evaluating whether the subspecies designations currently applied make sense in a genetic context. We sampled from four previously named subspecies and another presumed species.
- GBAM - Geomys bursarius ammophilus
- GBBR - G. b. brazensis
- GBDU - G. b. dutcheri
- GBMA - G. b. major
- GBSA - G. breviceps sagittalis
So, for this exercise we'll test to see if that number is supported by the genetic data or if some of those subspecies should be combined into a single name or if any subspecies should be further subdivided into additional named subspecies. We'll do this by testing K values from 2-10.
Structure is a software package that does this. Install 'structure' from bioconda. While you're at is, also install a helper script called 'structureharvester'.
Unfortunately, the native .structure output we asked for from populations isn't quite what structure actually looks for. So, we need to modify it a little. cd
to your 'structure' directory and enter the following commands exactly and you'll be ready to move on.
tail -50 ../populations/populations.structure | cut -d$'\t' -f 1-100002 >pop_mod100000.str
cut -f 1 pop_mod100000.str >list.txt
cut -f 2- pop_mod100000.str | sed 's|0|-9|g' >data.txt
paste list.txt data.txt > pop_mod100000.str
Now prepare to run 45 analyses, one for each value of K, 2-10 and 5 replicates of each. This series of commands will generate all of the necessary directories and files for you to use. It will also submit all of those jobs to the queue.
NEW
for r in $(seq 1 5); do for k in $(seq 2 10); do mkdir k${k}.${r}; done; done
for r in $(seq 1 5); do for k in $(seq 2 10); do cp pop_mod1000.str k${k}.${r}; done; done
for r in $(seq 1 5); do for k in $(seq 2 10); do sed "s|<k>|${k}|g; s|<r>|$r|g" ../bin/mainparams >k${k}.${r}/mainparams; sed "s|<k>|${k}|g; s|<r>|$r|g" ../bin/k.r.sh >k${k}.${r}/k${k}.${r}.sh; cp ../bin/extraparams k${k}.${r}/extraparams; done; done
for r in $(seq 1 5); do for k in $(seq 2 10); do cd k${k}.${r}; qsub k${k}.${r}.sh; cd ..; done; done
Each of these analyses will take several hours to complete. Once they're all done, we will need to evaluate the output to find out which value of K fit our data the best. Structure is a Bayesian analysis. In short, what it does is examine your data, assume a scenario is true and then evaluate that data under the assumptions of that scenario. We can then examine the likelihood of your data under that scenario and the k with highest likelihood is the one we choose. To get this value we'll use 'structureharvester'. It does as the name describes and harvests all of the output and generates a summary.
cd /lustre/scratch/[eraider]/gge2022/gopher_rad/structure
structureHarvester.py --dir=harvester/ --out=harvesterout --clumpp
You will see a new directory called 'harvesterout' and in that directory you'll find concatenations of all of the relevant output from each of the 45 structure runs. There will be two files for each value of k. There will also be a 'summary.txt' file. That's what we're interested in right now. Open it.
- What k has the highest likelihood (LnP(Data))? That is the k that best fits your data. This suggests that there are _____ genetically distinct populations in your data.
Finally, let's take a look at a visualization of the results. For the best k, we will run a program called distruct.py and that program will generate a pdf of the results. Distruct doesn't have a simple installation. You'll just use mine. There are just a couple of additional files that are necessary. Copy them from me. Note: This gets complicated. Follow the instructions very carefully.
cp -r /lustre/scratch/daray/gge2022/bin/distruct /lustre/scratch/[eraider]gge2020/gopher_rad
There are three files. The two 'gopher' files name the populations from their numerical designations assigned by structure and set the color scheme for up to 10 genetic groups. The 'drawparams' file sets the parameters for 'distruct' to create a figure. You should be able to see what each line means if you examine it. Another package, CLUMPP, is needed to summarize all of the data from the five runs for use by distruct.
Enter your distruct directory.
cd /lustre/scratch/[eraider]gge2022/gopher_rad/distruct
Copy the files for the appropriate value of k into the directory.
cp ../structure/harvesterout/<the appropriate indfile> clumpp_0
cp ../structure/harvesterout/<the appropriate popfile> clumpp_1
Use 'CLUMPP' to summarize the data for 'distruct'.
First you need to alter some paramfiles, which stores the parameters for the analysis. I've set up most of the file for you but it's missing all of the values of k that you recently determined. Use sed
. Replace 'YOUR VALUE OF k' with the appropriate integer. Make sure you're working in an interactive session. One processor is enough. First, you have to CLUMPP the individual results. Afterward, you will CLUMPP the population-level results.
Enter the individuals working directory.
cd /lustre/scratch/[eraider]/gge2022/gopher_rad/distruct/clumpp_0
Correct the values for k, the datatype for this run of CLUMPP, and indicate that you're CLUMPPing indviduals in this run.
sed "s/<k>/WHATEVER K WE DECIDED/g" paramfile.template | sed "s/DATATYPE X/DATATYPE 0/g" | sed "s/C X/C 25/g" | sed "s/outfile/indoutfile/g" >paramfile
Then run CLUMPP.
/lustre/work/daray/software/CLUMPP_Linux64.1.1.2/CLUMPP paramfile
This will take a while, maybe 10-20 minutes. The most important output file is just that, K.indoutfile.
Now for the population-level results. Enter the working directory
cd /lustre/scratch/[eraider]gge2020/gopher_rad/distruct/clumpp_1
Correct the values for k, the datatype for this run of CLUMPP, and indicate that you're CLUMPPing populations (n=5) in this run.
sed "s/<k>/WHATEVER K WE DECIDED/g" paramfile.template | sed "s/DATATYPE X/DATATYPE 1/g" | sed "s/C X/C 5/g" | sed "s/outfile/popoutfile/g" >paramfile
Then run CLUMPP again in this directory. It will run much faster.
/lustre/work/daray/software/CLUMPP_Linux64.1.1.2/CLUMPP paramfile
Last step! Run 'distruct'. Just like structure and CLUMPP, there is a parameters file for editing, drawparams.template. As with CLUMPP, I've set most of it up for you. You just need to tell it what value of k to use before running 'distruct'.
Copy the necessary files from your CLUMPP runs to the main 'distruct' directory.
cd /lustre/scratch/[eraider]gge2020/gopher_rad/distruct
cp clumpp_0/*.indoutfile .
cp clumpp_1/*.popoutfile .
sed 's|<k>|YOUR VALUE OF k|g' drawparams.template >drawparams
Run 'distruct'.
/lustre/work/daray/software/distruct1.1/distructLinux1.1
Then convert the .ps file to a .pdf.
gs -q -dSAFER -dNOPAUSE -dBATCH -sOutputFile=gopher.pdf -sDEVICE=pdfwrite -c .setpdfwrite "-c <</Orientation 3>> setpagedevice" -f gopher.ps
- How would you interpret this graph?
!Note: This is a VERY simplistic version of a structure analysis. It's designed to just show you the bare basics of the process. If you are ever doing this on your own for your own research, there are additional steps to ensure that you are getting statistically significant data. But, that's beyond the scope of this course. Here, I'm just trying to show you the basic process.