Bipartition method for ranked synthesis - OpenTreeOfLife/treemachine GitHub Wiki

Here in lies a description of a simple algorithm for constructing synthetic trees based on ranks using bipartitions.

##Extractions of bipartitions For each rooted tree for analysis, extract bipartitions.

For the trees

((a,b),c);
((a,d),c);
((a,c),d);
((a,c),b);

We get (ignore the first part, that is just output from my scripts). The second column is the biparition id. I will use that below

test.tre 0  a,b c
test.tre 1  a,d c
test.tre 2  a,c d
test.tre 3  a,c b

##Bipartition comparisons For each of the bipartitions, conduct an all by all comparison recording the conflicts between each bipartition.

For the above set, we get these conflicts (they are symmetrical)

a,b c conflicts with  a,c b
a,d c conflicts with  a,d c

##Resolve conflicts Start with a set of all biparitions. We will remove conflicting biparitions in order of rank. Assume we have a ranking of the sources and we have a hash of the bipartition (id) and the set of bipartitions to which this conflicts. While this hash has members, proceed to the next conflicting set in order of rank. Remove from the treeset all of the conflicting biparitions. Remove from the hash the conflicting set (both the focal bipartition and those that were found in conflict). When no more sets remain, all bipartitions in the treeset will be compatible and can construct a tree.

##Specific issues to treemachine and nexson processing While this procedure is not specific to treemachine, there are aspects of constructing synthesis in opentree that warrant comment.

  • Mapping to deepest node: Because we have trees with tips that start at different parts of the taxonomy, we map to the deepest node (so if you have kangaroo sister to homo sapiens, we will map the kangaroo to marsupials and homo to placentals). For the above procedure, we would output the entire taxonomy mrca of marsupials instead of either the ott id for marsupials or the one for kangaroos. Because the kangaroo one is not informative, it may make no difference. This should be checked.
  • Nexson processing will need to be done in treemachine because of the taxonomy processing that occurs there and the trimming of tips that are not mapped, etc.
  • Once a tree or subtree is produced (this only needs to be done in areas with overlapping taxa so plants by themselves, animals, etc.) this can then be loaded as is into treemachine. The only additional step in treemachine will be the placement of taxonomy.
  • An additional step will be required to store the necessary information of supported trees so the web application can give the supported edges.

##Complete aside In the output, I abbreviate consider the bipartitions with 0,1,and 2s. So 1100 would be a bipartition with the first two taxa as part of a clade. So for example, with name order a,b,c,d, 1100 would be created from any of these trees (((a,b),c),d) or ((a,b),(c,d)) or ((a,b),c,d) and 1010 would be with (a,c) in a clade together. I will use 2 to mean that the taxon was not present in the tree. So 1210 would be ((a,c),d) and the taxon b was not present in that tree.

##Examples ###First one Tree set is

0 ((a,b),c);
1 ((a,d),c);
2 ((a,c),d);
3 ((a,c),b);

Ranks are

[0, 2, 1, 3]

Conflicts are

{0: set([3]), 1: set([2]), 2: set([1]), 3: set([0])}

Compatibles are (though not necessary to calculate separately)

{0: set([1, 2]), 1: set([0, 3]), 2: set([0, 3]), 3: set([1, 2])}

Resolving conflicts by ranks looks like

rank 0 test.tre_0 in conflict
(set(['a', 'b']), set(['c']))
	removing (set(['a', 'c']), set(['b']))
rank 1 test.tre_2 in conflict
(set(['a', 'c']), set(['d']))
	removing (set(['a', 'd']), set(['c']))

Final bipartitions

1102 (set(['a', 'b']), set(['c']))
1210 (set(['a', 'c']), set(['d']))

Consensus tree

(d,(c,(a,b)))

###Second Tree set is

0 ((a,b),c);
1 (((a,c),d),e);
2 (((e,f),g),v,a);

Ranks are

[0, 2, 1]

Conflicts are

{}

final bipartitions

10122222 (set(['a', 'b']), set(['c']))
11201222 (set(['a', 'c', 'd']), set(['e']))
11200222 (set(['a', 'c']), set(['e', 'd']))
22212110 (set(['e', 'g', 'f']), set(['v']))
22212010 (set(['e', 'f']), set(['g', 'v']))

Consensus tree

(v,(d,(c,(a,b))),(g,(e,f)))

###Third Trees set is

0 (((a,b),c),d);
1 ((d,e,f),(g,h));
2 ((d,g),(a,f));

Ranks

[2, 0, 1]

Bipartition set (so the ids make sense)

0 test3.tre_0 11120222
1 test3.tre_0 10120222
2 test3.tre_1 22211010
3 test3.tre_1 22200101
4 test3.tre_2 02221102
5 test3.tre_2 12220012

Conflicts

{2: set([4]), 4: set([2])}

Resolving conflicts

rank 0 test3.tre_2 in conflict
(set(['d', 'g']), set(['a', 'f']))
	removing (set(['e', 'd', 'f']), set(['h', 'g']))

final bipartitions

11120222 (set(['a', 'c', 'b']), set(['d']))
10120222 (set(['a', 'b']), set(['c', 'd']))
22200101 (set(['h', 'g']), set(['e', 'd', 'f']))
02221102 (set(['d', 'g']), set(['a', 'f']))
12220012 (set(['a', 'f']), set(['d', 'g']))

Consensus tree (my script seems to make a knee in there)

(e,d,(h,(g)),(f,(c,(a,b))))