Generating Genomes - thekswenson/Zombi_wiki GitHub Wiki

Basic Mode (G)

To generate a genome it is first necessary to simulate a Species Tree using Zombi. If you want to use your own species tree, please refer to the Ti mode. Then, you can run the basic mode by typing:

python3 Zombi.py G ./Parameters/GenomeParameters.tsv ./Output_folder

To simulate genomes, Zombi starts with an ancestral genome at the origin of time, with a given number of genes. In the current version, all genes families present in this genome have a single copy.

A genome is an ordered collection of genes. So if we begin with a genome that has 4 genes, what we see is something like

Position Gene_family Orientation ID
0 1 + 1
1 2 - 1
2 3 + 1
3 4 + 1

The meaning of this is:

  • Position: The position in the genome. The genome is circular, so the position 4 is adjacent to the position 3 and 0
  • Gene_family: The identifier of the gene family
  • Orientation: The orientation of that gene in the genome
  • ID: The identifier of the gene (one ID for each node of the gene tree -- see the tip below for details)

Genomes evolve undergoing a series of events:

  • D: Tandem Duplications. A segment of the genome is duplicated. The new copy is inserted next to the old one.
  • U: Duplications. A segment of the genome is duplicated. The new copy is inserted anywhere in the genome.
  • L: Losses. A segment of the genome is lost.
  • T: Transfers. A segment of the genome is transferred to a contemporary species. The segment is inserted in a random position. Transfers can be replacement transfers, where a segment of genes are replaces in the target genome. Transfers "leave" from a donor chromosomes, and "arrive" at a receptor chromosome, so are denoted in files as LT and AT.
  • P: Transpositions. A segment of the genome changes its position within the genome.
  • I: Inversions. The order of the genes in a segment is inverted, along with their transcriptional orientations.
  • O: Originations. A new gene family appears, being inserted at a random position.

The rates are genome-wise. For instance, a duplication rate of 3 means 3 duplication events per genome per unit of time.

There is also an additional rate for each event (except for originations) called the extension_rate that controls how many contiguous genes are affected simultaneously by an event. By default, it corresponds to the p parameter of a geometric distribution (you can use a fixed number or an uniform distribution too)

Once that the full evolution of genomes has been simulated, Zombi prints also the gene trees associated with the different gene families, all the events taking place in each gene family, the events taking place in each branch and the genomes of each node in the species tree.

There are two other events that do not depend intrinsically on genomes but in the species tree that is used to simulate genome evolution

  • S: Speciation. When a genome arrives at a speciation node, the genome is divided and continues to evolve in both descendant branches
  • E: Extinction. When a genome arrives at an extinction event, the genome stop its evolution

An example of how genomes evolve can be seen in the next figure:

alt text

In this figure, we can see the Initial genome (IG), the Root genome (R), the ancestral genomes (one for each inner node of the Species Tree) and the genomes in the surviving leaves. Different events modify the genome composition. The genes affected are represented next to the letter indicating the event. For example, there is a loss event in the branch leading to n6 affecting the green gene.

[!TIP] Advanced details regarding the genes identifiers: skip this if you are reading for the first time.

Events that introduce nodes in the topology of the gene tree (Duplications, Transfers, Losses, Speciations and Extinctions), change the identifier of the gene. For example, let us say that at the root we have a gene whose identifier is 1. If the genome undergoes a speciation event, the each branches will inherit a unique copy of the gene from that same gene family: one whose ID is 2 and the other whose ID is 3. When a gene has been transferred, it changes both the identifier of the gene remaining in the genome and in the recipient genome. This way is easy to track the events that have given rise to different tree topologies. Inversions and transpositions do not introduce changes in the tree topology and for that reason, they do not change the identifier of the affected genes.

Output

G/Genomes: A folder with one file per node of the species tree. Each file contains information about the genome composition.

G/Gene_families: A folder with one file per gene family. Each file contains information about the events taking place in that gene family. There are 3 fields.

  • 1. Time: The time at which the event takes place
  • 2. Event: The type of event that takes place in a given time (S, E, D, T, L, I, P, O and F. F stands for Final, meaning that the gene arrived alive till the end of the run)
  • 3. Nodes: Some more information about the kind of event:

S, D and T: 6 fields separated by semicolons. This can be better understood by looking at the picture:

alt text

  • L, I, P, O and F: 2 fields separated by semicolons. First, the species tree branch where the event takes place and second, the identifier of the gene affected

G/Gene_trees: A folder containing the gene trees corresponding to the evolution of the different families and the gene trees pruned so that only surviving genes are represented.

There are two types of trees:

  • _completetree.nwk: A tree showing the complete evolution of that gene family
  • _prunedtree.nwk: A tree in which the genes that have not survived until the present time have been removed. Normally you want to use this tree!

It is also possible to output the reconciled trees in the format RecPhyloXML

G/Events_per_branch: (Not output by default) A folder with one file per branch of the species tree. Each file contains information about the events taking place in that branch. The codes are similar to the previously explained, but not the same. There are two main differences (for the sake of clarity). The first one is that transfers are divided into:

  • LT: Leaving Transfers. Transfers that leave this branch
  • AT: Arriving Transfers. Transfers that arrive at this branch.

The second difference is that the node of the nodes affected is given by:

GeneFamily_GeneIdentifier

So for example, if we go to the file n2_branchevents and we find the event L affecting at 4_3, means that the gene whose identifier is 3 belonging to the family 4 was lost in that branch in the time given by the first column

Please also notice that in the case of events that affect several genes, this will be reflected in the first column (several events taking place at the same unit of time)

G/Geneorder_events_per_branch: (Not output by default) A folder with one file per branch of the species tree. Each file contains information about the gene-order events taking place in that branch. There are 4 fields.

  • 1. Time: The time at which the event takes place
  • 2. Event: The type of event (D, U, LT, AT, L, I, P, O), where LT is a "leaving transfer" in a donor chromosome and AT is an "arriving transfer" in the receptor chromosome. LT is always followed by an AT. An arriving transfer is a "replacement transfer" (see REPLACEMENT_TRANSFER) if it is preceded by a loss L of those genes, at the same time.
  • 3. Breakpoints: The gene-order positions of the indices involved in the event in 0-based indexing. If the genome has genes +1 +2 +3 +4 and and inversion I event happens with breakpoints 1 and 2, then the result is genome +1 -3 -2 +4.
  • 4. Chromosome: The chromosome name where the event occurs.

G/Profiles: (Not output by default) Here there is a file called Copy_number_profiles.tsv that contains the node of the species tree in the columns and the gene families in the rows. The entries give the number of copies that each gene family has for each node of the species tree.

Parameters

DUPLICATION, TRANSFER, LOSS, INVERSION, TRANSPOSITION, ORIGINATION

The value for each type of event.

DUPLICATION_EXTENSION, TRANSFER_EXTENSION, LOSS_EXTENSION, INVERSION_EXTENSION, TRANSPOSITION_EXTENSION

The rates controlling how many simultaneous genes are affected by an event

REPLACEMENT_TRANSFER

A number between 0 and 1 controlling the probability of replacement transfers (they only happen if there is a homologous position in the recipient genome)

ASSORTATIVE_TRANSFER

If True, the recipient lineage is chosen according to a probability proportional to e ^ -alpha * the normalized phylogenetic distance. This option can be used to make transfers to be more likely between closely related lineages. The higher the alpha parameter, the more noticeable this effect will be.

ALPHA

See the previous parameter

INITIAL_GENOME_SIZE

Number of gene families present in the initial genome

MIN_GENOME_SIZE

The minimal size for a given genome. Smaller genomes will not be affected by losses events

GENEORDER_EVENTS_PER_BRANCH

0 or 1, indicating whether outputting the gene-order Events per branch or not

EVENTS_PER_BRANCH

0 or 1, indicating whether outputting the Events per branch or not

PROFILES

0 or 1, indicating whether outputting the Profile or not

GENE_TREES

0 or 1, indicating whether outputting the Gene Trees or not

RECONCILED_TREES

0 or 1, indicating whether outputting the reconciled trees in the format RecPhyloXML or not