Tutorial #1: Generating data to test Molecular Clocks inference - thekswenson/Zombi_wiki GitHub Wiki

In this tutorial, we show how Zombi can be used to generate data to test the reliability of Molecular Clock analyses. You can use the parameters files found in the folder Tutorial/Tutorial1.

Please bear in mind that this is just a small tutorial on Zombi and by no means a thorough analysis of Molecular Clock inference!

The goal of this tutorial is simulating some data that allows us to test the efficacity of Molecular Clocks to predict the real speciation dates of a dataset. We will generate some sequences that violate the molecular clock assumption of constant substitution rate. We will see the impact also of adding a calibration point, by sampling one of the dead lineages generated by Zombi

Simulating the Species Tree

To begin, let us compute a Species Tree using

python3 Zombi.py T ./Tutorials/Tutorial1/SpeciesTreeParameters.tsv MC_test

This generates a Complete Tree, i.e. a tree with dead lineages and extant clades too. You can inspect the file "SpeciesTreeParameters" and check a few things. We changed the STOPPING_RULE to 1, to make the total time 100 (time units), from the initial genome to the leaves.

You can see this tree (in Newick format) in any TreeViewer (I personally use SeaView, which I strongly recommend). This is what I have:

Complete Tree

The Newick tree can be found in here

If you paste that in SeaView and tick on "Br support" you can find the name of the inner branches too (I will refer to them later in this same tutorial)

You can see that many lineages do not make until the present time. These are extinct lineages.

Another thing to notice is that the total branch length of this tree from the root to the leaves is < 100. Why is it? The reason is that part of the simulation takes place before the first speciation event (and SeaView does not recognize the branch length that comes before the root). If you want to get the total age of this tree from the leaves to the roots, we can inspect the file T/Events.tsv. There we see that the first speciation event occurs at 7.93. The total age of the Crown group in this example is then 100 - 7.93 = 92.07

Simulating the (very simple) Genome

Then, we will simulate some sequences evolving along the tree. In a Molecular Clock analysis we want to use good markers of the Species Tree and for that reason, we are going to simulate a genome in which no events are taking place. Transfer, duplication and losses alter the topology of the gene trees making them different from the species tree. We will simulate a single gene tree reproducing the same topology of the Species Tree

python3 Zombi.py G ./Tutorials/Tutorial1/GenomeParameters.tsv MC_test

You can see that we changed a few things in the file GenomeParameters.tsv

We set the event rates to 0 in all cases, meaning that no events occur in those genomes. Then, we set the genome size to 1. A single gene is simulated.

Now we can inspect the folder G/Gene_trees/

We find two files, corresponding to the complete version of the gene tree and the pruned version

The file 1_completetree.nwk contains a gene tree that matches perfectly the topology (and branch length) of the species tree

The file 1_prunedtree.nwk contains the same tree but the extant lineages have been pruned

Simulating the Sequence. Changing the Scaling parameter

Finally, we want to simulate the sequence evolving under the gene tree. We will simulate protein sequences.

The most important parameter we want to study is the SCALING parameter. By default is 1. Since our gene total branch length from the Initial Genome (the starting point of the stem) to the leaves is 100, this means that, on average, each position in the sequence will have 100 substitutions.

This is clearly too high (if this were a real example with this substitution rate, we couldn't even group those sequences as homologous in the first place). For this reason, we change the scaling parameter to a smaller quantity.

Let us say that we want to have a tree with a total branch length of 0.2. This means that from the to the initial genome to the leaves, each position in the sequence has experience on average 0.2 substitutions.

We will set the SCALING to 0.2 / 100 = 0.002

In the file Tutorial1/SequenceParameters.tsv the SCALING parameter is already set to that value. I also changed the default WAG to LG, and the default length of the simulated sequence to 500.

Using the Su mode

We could run now the Sequence basic mode (S), but this will make things too easy. If we did this, we would generate a sequence evolving according to the strict molecular clock assumption (meaning that the substitution rate is constant through time and through time, one assumption that we have seen is violated most of the time)

Instead, let us use the advanced mode Su. To use this mode, you need first to run the script RateCustomizer

python RateCustomizer.py S Tutorials/Tutorial1/SequenceParameters.tsv MC_test

This creates a new folder. We go to

/MC_Test/CustomRates

And we open the file (just to have a look) GT_Subsitution_rates

We see the only gene family present in our experiment and a scaling parameter specific for that family. We leave it the way it is. We could control for different families evolving at different paces by changing the substitution rate multiplier value

Then, we open the file ST_Subsitution_rates

In this file, we have a complete list of all the branches of the Species Tree along with a multiplier that was generated randomly. What does this mean? It means that the effective substitution rate taking place along a given branch of the species tree is modified according to this value. The values there have been generating using a normal distribution with a mean of 0.7 and a standard deviation of 0.2 (you can modify this too at the bottom of the file SequenceParameters.tsv)

For example:

The branch n47 has a length of 34.943. Due to the scaling, we would expect that in that branch of the Species Tree, the gene family accumulates ~0.07 substitutions per site. However, in the file ST_Substitution_rates we see that the branch n47 has an associated rate multiplier of 0.34, making the effective substitution rate in that branch ~0.02 (34.943 * 0.002 * 0.34).

With a bit of scripting (or patience, opening the file in a text editor), you can change the values here and try very different multipliers. You could try for example to use very large substitution rate multipliers to certain specific branches of the Species Tree and see if this misleads the analysis.

In our case, we are going to change the rate substitution multipliers of some branches at the bottom of the tree. We set the branches n1, n4, n8, n10, n12 and n18 to 10. If you don't want to do it yourself, you could replace your file with the file in here

Finally, time to simulate the sequences using:

python ../ZOMBI/Zombi.py Su SequenceParameters.tsv MC_test

Please notice that the mode to use is Su and not S! Otherwise, Zombi will ignore the custom rates that we just created.

Bonus. Testing the Molecular Clock.

I will use PhyloBayes because it is the software I a most familiar with. It is pretty straightforward to use and it includes many nice features. Check PhyloBayes.

We have already generated some dataset we can use to test how good are Molecular Clock analysis. The rest of the tutorial is an example analysis. You don't have to reproduce the steps I took, only understand what I do!

Some boring details (that you can skip)

I created a folder where I put the alignment (that I converted to Phylip format to make it compatible with PhyloBayes, using SeaView) and the Extant Tree. To clarify, the alignment is the file 1_pruned.fasta, containing the sequence of the extant clades exclusively.

Before running the analysis I had to perform some routinary and boring bioinformatic tricks to make the things run. First one was making the names in the alignment be the same that the names that we have in the species tree (I did this using some tricks in Vim, removing the characters that go after the underscore). Second, I had to remove the inner nodes of the Extant Tree, otherwise, PhyloBayes does not accept the tree (I wrote a script to do this)

Then, I launch PhyloBayes. I use the UGAM model, and I will say that I have a rather good idea of where the root of my tree is. I set a prior for the root of 100, 30. I used the CAT model with the Poisson exchangeabilities, but I could have used LG too (if you are still reading you probably know what I am talking about!). The command in PhyloBayes looks like:

pb -ugam -rp 100 30 -d Aln.phy -T MyTree.nwk chain1

I leave it running for a while...

And then I check the trace. After 1000 iteration the chain seems to have reached a plateau

We read the information of the chain by running

readdiv -x 1000 20 chain1

Which creates a consensus chronogram and a bunch of other files

Comparing Real dates with Inferred dates

With a bit of Python scripting and making use of the ETE3 library (a great library for the phylogeneticist toolbox), I parsed the results from Zombi and the results from PhyloBayes

InferrenceWithoutCalibrations

Given that we are using simulated data, we do know the real age of the nodes (I wrote Million of Years as the time unit, but it could be any other time unit actually). We see that without any calibration and only the uninformative prior for the age of the root, PhyloBayes has a hard time guessing the right age of the speciation nodes.

Adding a calibration

What happens if we add a single calibration? Is this enough to estimate correct rates even in the presence of quite a few clades that are evolving substantially faster than usual? To answer that, I am going to assume that there is a fossil for the species n130. In our tree, the species n130 lives from time 53.58 and goes extinct without descendants at time 76.85 (you can see that very easily in the file T/Events.tsv).

Let us say that this species leaves a fossil exactly halfway through its existence (65.215). We retrieve that fossil at the present time and we are able to date it (using radioactive decay or any other absolute dating technique). We see that the fossil is 100 - 65.215 = 34.785 million years old (or any other unit of time you find useful!)

If we are able to place the fossil in the phylogeny, we will see that the most common ancestor of that lineage that is present in the extant phylogeny is the lineage n22. We can say then, that the minimum constraint for the age of that node (node n22) is 34.785.

We write that constraint in a file and now we launch PhyloBayes with the command:

pb -ugam -rp 100 30 -d Aln.phy -T MyTree.nwk -cal Calib.txt chain_calib1

Now the results look like this

Calib

We see that the correct ages fall always in the 95% Confidence Interval estimate by PhyloBayes (though almost always in the lower side!)

Where to go from here?

This tutorial was only a toy example of the analyses that can be performed with Zombi. A more thorough study of Molecular Clock would imply generating different datasets, using a different range of parameters and a more clear definition of the goal. Zombi makes these sort of analyses very easy to perform.