Name Matching Algorithm Re design - AtlasOfLivingAustralia/ala-name-matching GitHub Wiki

Motivation

The name matching algorithm is a rather complex collection of hard-coded rules attempting to handle the various lunacies that get thrown at it.

Rather than attempt to expand it and add to its complexity, a re-think is, possibly, in order.

The available information for name matching consists of one or more of the following

A scientific name, possibly partial or aggregate
An author
A vernacular name
Higher-order taxonomic information, such as kingdom, phylum, class, etc.
A rank
Location information
Date information
A taxonID, scientificNameID or taxonConceptID, potentially from a different namespace. These provide hints as to previous matches.

The matched name represents the lowest-ranked taxon that is most compatible with all the information available
- Or, if you prefer, the lowest-ranked taxon that does not conflict with the information available
- Exactly how we handle contradictory information is yet to be developed
Use the entirety of the information available at all times. This includes higher-order taxonomic information
Allow spatial information to be used for things like excluded and misapplied names, as well as sanity checking
Allow date information to be used to match names that existed at the time. This includes things like resolving parent-child synonyms where the original species has been moved to be a subspecies.
Allow homonym resolution
Allow old IDs to be mapped onto new IDs
Allow synonym resolution. This includes resolution of annoying pro-parte synonyms to the least upper bound
Handle spelling/orthographic variations gracefully, including switches between Latin genders
Handle rancid garbage such as aff. cf. sp. and voucher names.
Handle author abbreviations
Things that don't fit into the general flow of the algorithm should be rules-based and driven by an engine, rather than hard coded.

The source name index should be assembled from multiple sources and accumulate information

This is pretty much required to be a vanilla java library, so that it can be embedded in anything that needs name matching.
A new API is probably in order, allowing more information to be supplied. The old API needs to be kept for backwards compatibility.