Index building re design - AtlasOfLivingAustralia/bie-index Wiki

Motivation

At the moment, the bie-index imports a bunch of DwCA files and builds a taxonomic index by collecting the results together. Separately, the ala-name-matching system builds a similar index. The DwCAs supplied should have no overlap and need to have taxonIDs resolved, since otherwise you get a bunch of separate taxonomic trees with different identifiers for the same name.

This gets done in the Talend OS jobs that collect and process the data at the moment. It's all a bit over-complicated, since Talend OS isn't suited for dealing with hierarchies of information, as the CAAB processing attests to. Matching is done by attempting to match names/authors and is fraught with spelling problems, abbreviations and other sources of lunacy.

It also means that more detailed information about a taxon, such as standard vernacular names or more accurate placement in the taxonomic tree, that could be supplied by lower-priority information sources get discarded.

It would be nice to have a process that:

In addition, with the introduction of New Zealand name lists, there's a regional component that needs to be addressed. Something excluded or misapplied in Australia might be perfectly legitimate in NZ. If there is more detailed distribution information, then handling state-based misapplications might also be possible.

With distribution, displaying whether something is Aus-only, NZ-only, both, neither, PNG-only etc is something that should also be done.

Algorithm

This is more a set of principles at the moment. To be fleshed out.

Comments