Increases in error that led to substantial improvements - NicMcPhee/MICS-2014-GraphDB-EC GitHub Wiki

I think that one of the interesting doors that the graph DB work opens is the ability to analyze when evolution "crosses valleys", i.e., where steps where fitness gets worse (error increases) leads to steps where fitness then gets better (error decreases).

The query

I wrote a query that searches through the DB and finds such events on the root ancestry line of a specified individual (although I think there's a minor problem – see below):

start w=node(996270)
match (s)-[:ELITISM|PARENTOF|MUTANTOF|ROOT_XOOF]->(t)-[:ELITISM|PARENTOF|MUTANTOF|ROOT_XOOF*0..]->(w)
where s.penalizedFitness < t.penalizedFitness
match pDown = (q)-[downs:ELITISM|PARENTOF|MUTANTOF|ROOT_XOOF*]->(s)
where q.penalizedFitness > s.penalizedFitness
      AND all(rel in downs where endNode(rel).penalizedFitness < q.penalizedFitness)
match pUp = (p)-[ups:ELITISM|PARENTOF|MUTANTOF|ROOT_XOOF*]->(q)
where p.penalizedFitness < q.penalizedFitness
      AND p.penalizedFitness > s.penalizedFitness
      AND all(rel in ups where startNode(rel).penalizedFitness <= q.penalizedFitness
      AND endNode(rel).penalizedFitness >= p.penalizedFitness)
match (n)-[:ELITISM|PARENTOF|MUTANTOF|ROOT_XOOF]->(p)
where n.penalizedFitness > p.penalizedFitness

with distinct nodes(pUp) + nodes(pDown) as allNodes, p.penalizedFitness as startFitness, s.penalizedFitness as endFitness, p, q
with allNodes, startFitness, endFitness, p, q
order by (startFitness - endFitness)/(startFitness * length(allNodes)) DESC

return (startFitness - endFitness)/(startFitness * length(allNodes)), extract(x in allNodes | { id:id(x), pfit: x.penalizedFitness }) limit 10;

The overall goal is that the path from p->q->s should go up from p to q, and then down again from q to s. More specifically, this query finds a path of the form (assuming all edges are root path edges)

n->p-*->q-*->s->t-*->w

where

w is the end node (the "winner" for the 10K run in this case).
the t-*->w bit is just to ensure that the whole path is on the the root ancestry line from w
the fitness of q is greater (worse) than the fitness of every node from q->s. Since s<t, this means that the path from q to s is overall decreasing/improving (although it might locally go up and then back down again).
The fitness of q is greater (worse) than the fitness of every node from p->q, ie., that path is overall increasing/getting worse (although it again might locally go down and back up again).

I think the query is nearly correct, but it allows for flat spots on either end of the p->q->s region (which I don't really want), and at least once I think it went up again at the end. I think the solution is to add two more single edges on either side of the paths to allow us to be more precise about the relationships at the ends of the desired paths.

I also think there may be an issue at the ends. I'm not sure we can include t=w or n in generation 1, for example, unless fitness improves from generation 1 to 2, or gets worse from generation 99 to 100.

All the WHERE and math at the end sorts the paths by

(startFitness - endFitness)/(startFitness * length(allNodes))

which is the average fitness improvement per generation as a proportion of the fitness at the start of the path.

Example results

These plot show instances from the 10K run where the fitness got worse before getting (substantially) better. There are in fact the 7 sections of the root parent path that went up before going down and have the best per-generation improvement relative to the fitness at the start of the path.

It would be really interesting to compare the trees at the start of these paths to the trees at the end of these paths and see what happened there. It would probably be easiest to do this for the shorter paths first, and since the most of these are just sections of the longer paths, understanding the short ones would probably help us ultimately understand the longer ones.

Note! The generations on the x-axes on these graphs are all off by one (they're one too low). I used floor in the script above to get the generation based on the node ID, when I should have used ceil. So, for example, the first plot actually shows changes from generations 5 to 8, not 4 to 7 as the labels would have you believe.

Analyzing trees

Generations 5 - 8

The tree for generation 5 is

[/, /, -, x, -, /, 
-1.881440281867981, 
+, x, 2.946767807006836, x, -, -, 1.788446307182312, x, -4.794104099273682, -, /, -4.824312925338745, +, -3.4913867712020874, x, x]

and the tree for generation 6 is

[/, /, -, x, -, /, 
+, x, x, 
+, x, 2.946767807006836, x, 
-, -, 1.788446307182312, x, -4.794104099273682, 
-, /, -4.824312925338745, +, -3.4913867712020874, x, x]

The change is that the constant -1.881440281867981 was replaced with the expression (+ x x). This made the fitness worse but (I'm guessing) creating an opportunity to modify that non-constant addition expression in important ways. If that's true that suggests that the context semantics at that location in the initial tree were interesting in some way, allowing us to tune the behavior of the function without totally destroying it.

That "analysis" might have been complete nonsense, as the next change had nothing to do with that part of the tree. The tree for generation 7 is

[/, /, -, x, -, /, 
+, 
x, x, +, x, 2.946767807006836, x, 
+, +, /, x, x, /, x, x, x, 
-, /, -4.824312925338745, +, -3.4913867712020874, x, x]

and the change was the subtree [ -, -, 1.788446307182312, x, -4.794104099273682] in generation 6 got replaced by the subtree [+, +, /, x, x, /, x, x, x] in generation 7. Note that much of the new subtree is actually a constant constructed from two instances of [/ x x] and simplifies to 2 + x, while the old subtree was a constant minus x: 6.582550406455994 - x.

The tree for generation 8 is

[/, /, -, x, -, /,                                                                                                   
/, -, x, x, -,                                                                                                       
x, x, +, x, 2.946767807006836, x, +, +, /, x, x, /, x, x, x, -, /, -4.824312925338745, +, -3.4913867712020874, x, x]

This crossover is really strange and I think very interesting and perhaps important. If you just "diff" the trees, it says that the operator + in generation 7 is replaced with /, -, x, x, - in generation 8. But that's impossible with subtree XO or mutation, so the only way that could happen is if we did XO that happened to replace a large chunk of the tree with the same thing, only changing a little bit at the top. This (a) seems rare/unlikely and (b) is effectively impossible with just mutation, because the odds of generating exactly the right tree are pretty close to nil.