gmyc_exercise - Pas-Kapli/tutorials GitHub Wiki

Intorduction

The Generalized Mixed Yule Coalescent (GMYC) model is a very popular methods for single locus species delimitation that was initially introduced in 2006 (Pons et al., 2006). The GMYC method (Fujisawa and Barraclough, 2013) uses a speciation (Yule, 1925) and a neutral coalescent model (Hudson, 1990). It strives to maximize the likelihood score by separating/classifying the branches of an ultrametric tree (in units of absolute or relative ages) into two processes; within and between species.

Input: Binary (fully resolved), ultrametric (all tips are equally distant from the root) tree

Software: R package "splits" and web-service

Necessary software for this tutorial:

install "splits" :

launch R

$ R 

install dependencies:

>install.packages("ape", dependencies=T)
>install.packages("MASS", dependencies=T) 
>install.packages("paran", dependencies=T)

or

> install.packages(c("ape", "paran", "rncl"))

install splits:

> install.packages("splits", repos="http://R-Forge.R-project.org")

We will also need:

BEASTv1.8.3

Figtree

Tracer

mafft

RAxML

Exercise: Species delimitation with the single-threshold GMYC

Working directory: ~/workshop_exercises/distance_methods/branchiomma/gmyc

Note: If you didn't create this directory during the linux tutorial create it now using mkdir

1.1. Estimate alignment

$ mafft BR_cob_57ind.fasta > BR_cob_57ind_mafft.fasta

BR_cob_57ind.fasta

BR_cob_57ind_mafft.fasta

1.2. Remove identical sequences:

$ raxmlHPC-PTHREADS-SSE3 -f c -s BR_cob_57ind_mafft.fasta -m GTRGAMMA -n read_alignment

A reduced file is automatically created: BR_cob_57ind_mafft.fasta.reduced in phylip format

Check the difference:

$ head -n 1 BR_cob_57ind_mafft.fasta.reduced

$ grep ">" BR_cob_57ind_mafft.fasta | wc -l

Q: How many sequences were in the fasta file and how many are there in the reduced phylip file?

Convert phylip to fasta:

$ phylip_to_fasta.py BR_cob_57ind_mafft.fasta.reduced

phylip_to_fasta.py

The output file will be named BR_cob_57ind_mafft.fasta.reduced.fasta, rename it into BR_cob_57ind.reduced.fasta:

$ mv BR_cob_57ind_mafft.fasta.reduced.fasta BR_cob_57ind.reduced.fasta

1.4. Infer ultrametric tree using Beast v.1.8.2:

There are many methods to convert a phylogenetic tree (i.e., branch lengths represent expected number of substitutions per site) to ultrametric (i.e., branch lengths represent time) such as PATHd8 (Britton et al., 2007), r8s (Sanderson, 2003) and methods that estimate the ultrametric tree from the alignment such as Beast (Drummond & Rambaut, 2007). Empirical data suggest that GMYC provides the best results when combined with Beast ultrametric trees (Tang et al., 2014).

1.4.1 Prepare Beast file with Beauti:

Start Beauti

$ ~/phylogeny/BEASTv1.8.2/bin/beauti

File -> Import Data and select the fasta file BR_cob_57ind.reduced.fasta

Tab "Sites": Set the substitution model to HKY + Gamma (Site Heterogeneity Model) in the ideal case you should run a model selection (e.g. with jModelTest) analysis to decide which of the available models is the most fitting for your dataset

Tab "Clocks": For the exercise we will first use the "Clock model: Strict"

The "Lognormal relaxed clock" assumes that the evolutionary rate may vary across the tree according to a log-normal distribution. This option is suitable when we expect that many species are involved in the phylogeny that may evolve at different rates. However, if the dataset includes only few and closely related species the strict clock might be a good option too. This model is simpler and assumes that the substitution rate is constant across the entire tree.

Tab "Trees": For the exercise we will use the "Speciation: Yule process"

The tree prior expresses our assumption on the branching process through time for a group of organisms. The two main options is whether to use a speciation or a coalescent model. The speciation models (pure birth and birth-death models) are more approapriate for datasets involving many species while the coalescent model are intended for datasets involving one or perhaps few very closely related species.

The GMYC method uses the branching times to identify the speciation and the coalescent processes. Therefore, the clock and the tree prior are both very important parameters in getting accurate delimitation results.

Tab "MCMC": Set the chain length to 10000000, and change the file name to "BR_cob_57ind.reduced_Strict"

Once everything is ready click the button on the down right corner "Generate Beast File". The pop-up window notifies us which priors are set to their default values, click "Continue" and save the BR_cob_57ind.reduced_Strict.xml file to a new folder named Strict

Check the effect of the clock prior by performing one more run under the relaxed log-normal clock prior. Save the xml in a new folder called LN with the name BR_cob_57ind.reduced_LN.xml

1.4.2 Run Beast

$ ~/phylogeny/BEASTv1.8.2/bin/beast -beagle_off BR_cob_57ind.reduced_Strict.xml

omit the -beagle_off option if the beagle library is available

1.4.3 Check the performance of each run with Tracer:

$ tracer BR_cob_57ind.reduced_Strict.log

BR_cob_57ind.reduced_Strict.log

BR_cob_57ind.reduced_LN.log

In a good run the ESS values will be high (> 200) and the trace plot for all the parameters should show that a certain plateau was reached. When this is not happening we could try running the analysis longer, sample more frequently, make sure that are priors are appropriate and that there is not something wrong with our data.

Posterior trees under the strict clock: BR_cob_57ind.reduced_Strict.trees.gz and under the relaxed log-normal: BR_cob_57ind.reduced_LN.trees.gz

Extract the tree files:

$ gzip -d BR_cob_57ind.reduced_Strict.trees.gz

1.4.4 Calculate the Maximum Clade Credibility Tree with Treeannotator

$ ~/phylogeny/BEASTv1.8.2/bin/treeannotator -heights median -burninTrees 1000 BR_cob_57ind.reduced_Strict.trees BR_cob_57ind.reduced_Strict.tre

Check the output tree with figtree:

$ figtree BR_cob_57ind.reduced_Strict.tre

Annotated trees: BR_cob_57ind.reduced_Strict.tre and BR_cob_57ind.reduced_LN.tre

1.5. Estimate the number of species with the GMYC method:

$ R
> library(splits)

load tree

> tr_strict <- read.nexus("BR_cob_57ind.reduced_Strict.tre")

plot tree

> plot(tr_strict)

estimate the number of species:

> res_strict <- gmyc(tr_strict)

check the results:

> summary(res_strict)

Result of GMYC species delimitation

	method:	single
	likelihood of null model:	185.4816
	maximum likelihood of GMYC model:	202.3221
	likelihood ratio:	33.68108
	result of LR test:	4.855644e-08***

	number of ML clusters:	9
	confidence interval:	7-9

	number of ML entities:	15
	confidence interval:	12-16

	threshold time:	-0.01305178

Check the species assignment:

> spec.list(res_strict)
   GMYC_spec sample_name
1          1      BR_057
2          1      BR_055
3          2      BR_107
4          2      BR_006
5          2      BR_016
6          2      BR_088
7          2      BR_108
8          3      BR_054
9          3      BR_065
10         3      BR_059
11         4      BR_075
12         4      BR_074
13         5      BR_080
14         5      BR_062
...      ...      ...
39        14      BR_048
40        15      BR_076

Visualize the results with plots:

> plot(res_strict)

Save the plots as a pdf file:

> pdf("gmyc_strict.pdf")
> plot(res_strict)
Hit <Return> to see next plot: 
Hit <Return> to see next plot: 
Hit <Return> to see next plot: 
> dev.off()

Perform the GMYC species delimiation with the BR_cob_57ind.reduced_LN.tre and the Carabus_mafft_Strict_HKY.tre

⚠️ **GitHub.com Fallback** ⚠️