Error checking in KeyBase - AtlasOfLivingAustralia/ala-keys-ui GitHub Wiki

There are various types of errors that may occur in key files that people try to upload into KeyBase. As these errors may prevent the key from working and in the severest case might break the application, they need to be detected before a key is uploaded.

In the following the key to upload is taken as a delimited text file with three columns, the first column is the couplet number or from-node, the second column the lead-text and the third column the to-node. The to-node can be either another couplet in the key or a keyed-out item. The keyed-out item can be either a taxon or a sub-key. In SDD – the standard for descriptive data and the other format we would like KeyBase to be able to deal with – the from-node would be the Parent, the lead-text the Statement and the to-node either the Taxon (always in KeyBase exports at the moment) or the SubKey element.

In KeyBase, keys are treated as directed rooted tree graphs, with the nodes the vertices and the leads the edges, so a lead (edge) connects one node (vertex) to another. There are two types of nodes: terminal nodes ('leaves' in graph terms) which only have a lead leading to them, but no leads going from them that leads to another node, and internal nodes, that have both a lead leading to them and two or more leads going from them. The root node is an internal node, the lead leading to it is the root. This terminology is used below to describe the errors.

The errors have been divided into two classes: (fatal) errors and warnings. Fatal errors will break the key and may cause problems in the application, for example an infinite loop when uploading the key, so, if KeyBase detects a fatal error in a key, it will not let you upload the key. Errors for which KeyBase issues warnings will not break a key. Warnings will also be issued for things that may not be errors at all, such as polytomies and phrase names (which KeyBase won't recognise as taxon names and hence as a terminal node and for which it will issue a 'possible dead end' warning).

KeyBase checks for the following types of errors in each lead:

###Too few columns### From-node, lead-text and to-node are all required and the absence of at least from-node and to-node will certainly break a key, so any empty cells in the first three columns will result in a fatal error.

###Singleton leads### I don't think that "couplets" with single leads will break a key, but they have no place in a key and are always unintentional, so KeyBase will issue a 'fatal error' message anyway.

###Polytomies### Polytomies are "couplets" with more than two alternatives (leads). They are not exactly best practice, but not necessarily wrong, so only a warning message is issued.

###Orphan nodes#### Orphan nodes are from-nodes that are not among the to-nodes (and are not the root node). These nodes and all further nodes they lead to cannot be keyed out. Because KeyBase traverses the key when uploading it these parts of the key can also not be uploaded, hence orphan nodes result in an error.

###Loops### Loops are cycles in the graph that are caused when a lead leads to a node that is on the path that has already been traversed (starting with the root node, which is the first node on the path). Loops are the worst kind of errors, as they throw the upload script in an infinite loop. They are also the hardest to detect, as KeyBase will have to remember the path that has been traversed in order to know that the path is looping back on itself. A special case where the key loops back to the root node (or root rather), which doesn't cause a problem for the upload script – and which KeyBase doesn't currently test for, but does later in the application when you try to make a bread crumb trail or a hierarchy of trees, is when a key to subordinate taxa keys out the taxon (to which the subordinate taxa are subordinate) itself. This can happen (quite easily) when the wrong key is uploaded. We have had it in the Flora of Victoria project when the Key to the genera of Cunoniaceae actually contained the Key to the families of Dicotelydons (of which Cunoniaceae is one). This broke the Flora of Victoria project because most taxa are "dicots" and Cunoniaceae comes before Dicotelydons (when ordered alphabetically), causing an infinite loop when trying to create the bread crumb trail (was not quite as easy to troubleshoot as it sounds). A similar scenario will break tree hierarchy (maybe) and local filters as well, so this test should be added to KeyBase.

###Reticulations### Reticulations are cycles in graphs that are not loops. They occur when multiple leads lead to the same (internal) node, typically when multiple trees have been merged into one. Reticulations (that are not loops) are not necessarily a problem – so will only produce a warning – but KeyBase deals with reticulations by repeating the subtree after the reticulation (thereby turning the graph into a tree again), so keys with many reticulations may result in very large keys that may lead to scripts running out of time or memory (that has only happened for a single key so far and that was a shocker). Keys with ironed-out reticulations also become harder to maintain, as every edit will have to be made multiple times.

Probably better than turning reticulations from errors into warnings is have KeyBase deal with reticulations in a different way. SDD has a Reticulation element, treating reticulations essentially as sub-keys (for which it has the Subkey element), the difference being that a sub-key point to (the root node of) a different key and a reticulation node to an internal node in (mostly) the same tree. We have to see if we can implement something similar in KeyBase (always use Subkey I reckon).

###Dead ends and possible dead ends### Dead ends are to-nodes that are neither terminal nodes nor internal nodes. KeyBase does two tests. If the to-node is among the from-nodes, it is an internal node. If (the value of) a to-node that fails the internal-node test is numerical, KeyBase considers it a 'dead end' and that is an error. If the to-node contains text, KeyBase uses a regular expression to test whether it maybe a taxon name or the name of a sub-key (starting with 'Group'), and hence a terminal node. If the to-node fails this test, it is considered a 'possible dead end'. Possible dead ends will only generate a warning, as they may be bona fide end nodes and may not have been recognised as such either because they are junk (incl. 'phrase') names or because my regular expressions are not that good.

###Will not key out### To-nodes that will not key out are the result of orphan nodes somewhere along the path between them and the root of the key.