Cypher query sheet - NicMcPhee/MICS-2014-GraphDB-EC GitHub Wiki

We should collect together queries that we've found useful (or found didn't work).

Indexing

Creating an index on generations

create index on :Individual(generation);

Creating an index on fitness

Unfortunately this doesn't seem to generate an index, and I'm not sure why. Maybe we can only index on discrete values? That's a bummer, because it would be nice to be able to index on fitnesses.

create index on :Individual(penalizedFitness);

What's really weird is that it looks like we can generate an index on non-penalized fitness, as this works:

create index on :Individual(fitness);

OK, some more digging suggests that it had automagically created an index on fitness. When I run

:schema ls

which lists all the available indices, it lists fitness. So it either created it "for free" when I started up Neo4J, or my attempt at creating it worked and didn't return something useful. Weird.

Winners and their ancestry

Finding the "winning" fitness

This finds the best value of penalizedFitness over the entire run. You could use fitness to get the unpenalized fitness.

start n=node(*) 
return min(n.penalizedFitness);

Find a winning individual

start n=node(*) 
return n 
order by n.penalizedFitness 
limit 1;

Finding all winning individuals

This finds all the individuals that have the best (minimal) (penalized)fitness.

match (n:Individual) with min(n.penalizedFitness) as mf match (m:Individual {penalizedFitness: mf}) return m;

Find the root ancestry branch from an individual

This finds the root ancestry path from individual 99000 to the initial generation.

start n=node(99000) 
match ps = (n)<-[r:ELITISM|PARENTOF|ROOT_XOOF|MUTANTOF*]-(p) 
return distinct ps;

Get the fitnesses on (non)root ancestry branch from an Individual

This gives the sequence of all the fitnesses along the root ancestry branch from individual 99000 to the initial generation. Replacing ROOT_XOOF with NONROOT_XOOF does the same, but along the non-root ancestry branch.

start n=node(99000) 
match ps = (n)<-[:ELITISM|PARENTOF|ROOT_XOOF|MUTANTOF*]-(p) where p.generation=1 
return extract(x in nodes(ps) | x.fitness);

Find all the copying events in the winning sequence

This finds the root parent path from node 99000 to the starting generation and then isolates (using filter) the edges that were either elitism or reproduction.

start n=node(99000) 
match ps = (n)<-[r:ELITISM|PARENTOF|ROOT_XOOF|MUTANTOF*]-(p) 
return filter(x in relationships(ps) where type(x) = "ELITISM" OR type(x) = "PARENTOF");

Finding maximal paths of copying on winning sequence

This monster finds all the maximal sub-sequences of the root path from node 99000 that consist of entirely elitism or reproduction. The big MATCH clause finds:

A path from n to an anonymous node (say x) consisting of just root parent steps (elitism, reproduction, mutation, and root parent XO)
A single step from x to p consisting of either mutation or root parent XO (i.e., not elitism or reproduction).
A path from p to q of just elitism and reproduction – this is the path we're after.
A single step from q to some anonymous node (say y) consisting of either mutation or root parent XO (i.e., not elitism or reproduction).

The two single steps are there to "bookend" the desired path, ensuring that we have a maximal path of elitism or reproduction.

start n=node(99000) 
match (n)<-[:ELITISM|PARENTOF|ROOT_XOOF|MUTANTOF*]-()<-[:MUTANTOF|ROOT_XOOF]-(p)<-[r:ELITISM|PARENTOF*]-(q)<-[:MUTANTOF|ROOT_XOOF]-() 
with p, q, length(r) as l, (p.generation+q.generation)/2.0 as g 
order by g 
return distinct p.fitness, l, q.generation, p.generation;

This generates a table such as

p.fitness           l   q.generation	p.generation
39.99795368655481	1	3	4
35.25790411796866	1	6	7
39.99795368655481	2	8	10
39.99795368655481	1	12	13
31.849136929437357	1	21	22
19.933053716734456	1	30	31
17.5540300257814	5	34	39
16.795738702936962	2	41	43
15.710449230613747	1	48	49
15.485009361166156	1	50	51
9.811316962843701	2	54	56
8.968491179434844	2	61	63
6.777071487628525	4	65	69
6.795367268011395	1	76	77
6.617312802151567	1	78	79
6.133201810958635	7	80	87
6.160256037512394	4	88	92
6.146584251064052	1	93	94

The third row from the bottom tells us that there is a chain of 8 generations (from 80 to 87) where every individual had fitness 6.133 and that the individual in generation 80 was the ancestor of each of the other 7 via some combination of elitism and reproduction, i.e., those 7 were all exact copies of the generation 80 individual.

My hypothesis is that the the longest sequences will tend to be towards the end of a run, but I haven't really tested that. That's sorta kinda true if you squint in this example, but even that's kinda fuzzy, and I suspect the behavior might be quite different across multiple runs.

Finding the common shared ancestor

This is quite fast and illustrates how we probably want to be using collections.

Finds all ancestors in generation 52 (100-48) of any individual in the final generation.

match (n:Individual {generation: 100}) 
match ps = (p)-[:ELITISM|PARENTOF|MUTANTOF|ROOT_XOOF*..48]->(n)
with collect(ps) as paths, max(length(ps)) as maxLength
with filter(path in paths where length(path)=maxLength) as longest
return distinct [ pt in longest | nodes(pt)[0] ];

Finds the set of all common ancestor nodes

match (n:Individual {generation: 100}) 
match ps = (p)-[:ELITISM|PARENTOF|MUTANTOF|ROOT_XOOF*]->(n)
with collect(ps) as paths, max(length(ps)) as maxLength
with filter(path in paths where length(path)=maxLength) as longest
return reduce(s = nodes(longest[0]), pt in longest | filter(nd in nodes(pt) where nd in s));

Finds the most recent common ancestor.

match (n:Individual {generation: 100}) 
match ps = (p)-[:ELITISM|PARENTOF|MUTANTOF|ROOT_XOOF*..100]->(n)
with collect(ps) as paths, max(length(ps)) as maxLength
with filter(path in paths where length(path)=maxLength) as longest
with reduce(s = nodes(longest[0]), pt in longest | filter(nd in nodes(pt) where nd in s)) as pre
return last(pre);

This is an alternative to the previous query that I think will work (but I haven't tested it yet) and might be slightly more efficient? - Nic

match (n:Individual {generation: 100}) 
match (p:Individual {generation: 1}) 
match ps = (p)-[:ELITISM|PARENTOF|MUTANTOF|ROOT_XOOF*]->(n) 
with collect(ps) as longest 
with reduce(s = nodes(longest[0]), pt in longest | filter(nd in nodes(pt) where nd in s)) as pre 
return last(pre);

Impact of root parents vs. non-root parents

How often are root parents more fit than non-root parents?

This finds all the nodes on the root parent path to a winner, and then for each Individual on that chain that is the result of XO, computes the difference between the root parent's fitness and the non-root parent's fitness, returning the generation and this difference.

start n=node(99000) 
match (n)<-[:ELITISM|PARENTOF|ROOT_XOOF|MUTANTOF*]-(p)
match (p)<-[:ROOT_XOOF]-(r)
match (p)<-[:NONROOT_XOOF]-(s)
with p, r, s, s.penalizedFitness - r.penalizedFitness as diff
return p.generation, diff order by diff;

This returns something like:

2	-828.2125358185428
17	-6.442315176347453
24	-4.866373746564939
14	-3.44373890228956
15	-1.1434684157875452
12	-0.6294014981797531
74	-0.18160621232085106
11	-0.09999999999999432
5	0.240000000000002
33	0.42316293097401925
29	1.1976680139671814
76	1.5771595918735866
73	2.0053961671315577
93	2.2963387877962216
47	2.3330829124719656
32	2.367694487543506
78	2.451013321800727
95	2.536112724033611
65	3.0982596926559953
64	3.290127066211806
26	3.7418479997854064
70	4.360405872449269
8	4.560049568586152
46	4.592193210704526
57	5.039008891081306
20	5.163206708353698
52	5.29135735087857
50	5.481737474259397
44	5.714852689575078
45	5.794053970599624
30	5.991421851057115
59	6.179128106675442
75	6.804410617916949
48	7.088313174438644
25	7.135902914044095
40	7.155726723126623
18	7.314053319398269
60	7.327222529409134
53	7.81161863532895
34	8.084012483892337
19	8.308816757117448
58	8.759843646057597
54	10.066608209943368
61	10.30189177816499
16	11.393364815677668
27	14.458555832046073
80	17.898070226432402
21	18.817497028470385
71	23.05633432861173
3	23.109040805748045
88	40.58281084808121
72	70.9177598905392
6	77.99318948920876
41	80.2531035412967

It's clear that it's almost always the case that the root parent was more fit than the non-root parent, and the exceptions tended to be early. (It also turns out that the XO point for those early exception tended to be high in the tree; it's not hard to extend this query to include that info.)

This variant of actually graphs that path, with the XO events included:

start n=node(99000) 
match ps = (n)<-[:ELITISM|PARENTOF|ROOT_XOOF|MUTANTOF*..25]-(p)
match (p)<-[:ROOT_XOOF]-(r)
match (p)<-[:NONROOT_XOOF]-(s)
with ps, p, r, s, s.penalizedFitness - r.penalizedFitness as diff
return ps, p, r, s, diff order by diff;

Who has descendants?

Finding initial individuals with root descendants at end

match (n:Individual {generation: 100})                      
match ps = ()-[:ELITISM|PARENTOF|MUTANTOF|ROOT_XOOF*99]->(n)
with collect(distinct nodes(ps)[0]) as starts               
return starts;

This works nicely, even on the 10K runs (finishing in under 2 secs). The 100 on the first line and the 99 on the second line need to be the number of the last generation and one less than that (i.e., the number of edges on a path from the first generation to the last).