Index building re design - AtlasOfLivingAustralia/bie-index GitHub Wiki

Motivation

At the moment, the bie-index imports a bunch of DwCA files and builds a taxonomic index by collecting the results together. Separately, the ala-name-matching system builds a similar index. The DwCAs supplied should have no overlap and need to have taxonIDs resolved, since otherwise you get a bunch of separate taxonomic trees with different identifiers for the same name.

This gets done in the Talend OS jobs that collect and process the data at the moment. It's all a bit over-complicated, since Talend OS isn't suited for dealing with hierarchies of information, as the CAAB processing attests to. Matching is done by attempting to match names/authors and is fraught with spelling problems, abbreviations and other sources of lunacy.

It also means that more detailed information about a taxon, such as standard vernacular names or more accurate placement in the taxonomic tree, that could be supplied by lower-priority information sources get discarded.

It would be nice to have a process that:

Produces an index suitable for both the bie-index and ala-name-matching so that we can build the thing once and have it consistent.
Accepts a collection of complete DwC-As from different sources and resolves them
Collates all the information available from all the sources and annotates a single taxon with things like additional spellings, author variants, vernacular names, identifiers from other sources, establishment means etc. etc.
Produces an accurate homonym index that reflects the data available.

In addition, with the introduction of New Zealand name lists, there's a regional component that needs to be addressed. Something excluded or misapplied in Australia might be perfectly legitimate in NZ. If there is more detailed distribution information, then handling state-based misapplications might also be possible.

With distribution, displaying whether something is Aus-only, NZ-only, both, neither, PNG-only etc is something that should also be done.

Algorithm

This is more a set of principles at the moment. To be fleshed out.

Taxa are identified by canonical name and their position in the Linnaean hierarchy. Basically, if kingdom, phylum, class, etc. match and something has the same name, then they're the same thing.
- Authors can be ignored for the most part. This will need to be revisited if there's been a revision and a taxon has had a minor movement up or down the hierarchy not visible in the major ranks.
- Multiple name representations (eg. whether there's a subgenus included or some sort of rank marker) accumulate at the taxon position. So something might have multiple name entries containing things like variations in author spelling etc.
- Multiple identifiers, multiple vernacular names, multiple references, etc. also all accumulate on the taxon.
- A source may end up providing more accurate placement of a taxon.
Once all the accumulation is done, a single preferred authority is chosen as the name and taxonID for each
Different sources are not equal. There's usually a preferred source for a region of the taxonomic tree. For example, AusFungi may take precedence over APNI for fungi-related matters.
Synonyms are then mapped onto the taxon objects
- It may be the case that sources are in disagreement about whether something is an accepted taxon or a synonym. Resolution based on authority order may be required. This may create the situation where a parent taxon has been removed from the taxonomic tree, which will need some sort of further resolution.

Comments

It would be good to investigate how GBIF handle this issue and try and keep our approach similar, where possible. NdR.
- Dave has pointed me towards https://github.com/gbif/checklistbank DP
I'm wondering how 2 similar but slightly different taxonomies would be represented? Would both exist in their imported structure but then simply have relationship links to the other taxonomies at the species (or lower) level or would the preferred taxonomy be in there and the lesser taxonomy only have the end branches (species, etc.) linked to the ends of the preferred taxonomy? In other words could I pull out the lesser taxonomy, including all higher taxa from the system? Or could I only get the version associated with the preferred taxonomy? NdR.
- My thoughts were that you would get the version associated with the preferred taxonomy, with the concepts in the preferred taxonomy annotated with additional information about what other sources have to say about them. Preferred is a complicated concept, however, since eg. AusFungi seems to have a much better idea of where fungi should go than APNI, where some fungi have crept in, even though APNI is supposed to be the be-all and end-all. DP
The two indexes currently have different use cases:
- Name matching index: Heres a classification, give me the best single match
- BIE: Heres a string, search across anything in the ALA (taxa, layers, regions, place names, data set names, collections, wordpress pages) and allow faceting on the results. The differences are at the built index level and the API. Thats not to say they couldn't be merged, but the effort required for merging may not be worth it. DM
The name matching index supports a nested set structure, which the BIE index does not. This allows for searching across the occurrence records via any taxonomic rank. Maintaining/building the nested set is a pain as it means anything that is added to the classification necessitates a full rebuild of the index, and a reprocess in the Biocache. DM
My thoughts were that the process of resolution would necessarily create something that looks like the name matching index, so we might as well use it. If I re-process the source data at all, I rebuild the bie-index from scratch, since stuff may have shifted around significantly during the Talend OS resolution. DP