Name Matching - AtlasOfLivingAustralia/profile-hub GitHub Wiki

Naming matching is a complicated process. It is triggered under three scenarios:

  1. When creating a new profile
  2. When editing the name of an existing profile
  3. During a bulk import of an existing dataset

The first two scenarios behave in exactly the same manner. The third is a bit different because it is an automatic process (no human involvement).

There are two sources of name matching:

  1. The ALA Name index
  2. The National Species List (NSL)

Matching ALA Names

Names are matched against the ALA Name Index for two purposes:

  1. The ALA Name Index caters for complex matching rules which are not currently supported by the NSL.
  2. Matching to an ALA name allows the profile to access information from other ALA systems, such as images and occurrence information.

The ultimate source of names in the ALA Name Index is the NSL. Therefore, barring synchronisation delays, there should be no discrepancies between ALA and NSL's APC name lists.

Matching NSL Names

Names are matched against the NSL to allow access to the protolog for the name (the first recorded usage of the name), and the APNI name concepts (which the user can select from on the Profile Edit page).

Names are matched to the NSL using the acceptable-name service.

Name matching when creating/editing a profile

  1. The user enters the name, with or without the authority (e.g. "Acacia dealbata" or "Acacia dealbata Link"
    1. The user clicks the Check Name button
  2. The system attempts to match the name using the ALA Name Matching API
    1. This API follows complex matching rules, including following synonyms, looking at different levels of the taxonomic hierarchy, and so on. The details of this process is not covered here.
    2. If there is a single exact match, then the system will present the user with options to
      1. Proceed, whereby the profile will be matched to the name, and the profile name and the matched name will be the same; or
      2. Manually select a matching name from the ALA, whereby the profile name will be different to the matched name
    3. If there is single non-exact match, then the system will present the user with options to
      1. Use the matched name instead, which will change the profile name to the matched name; or
      2. Proceed with the name as they entered it, whereby the profile name will be different to the matched name; or
      3. Manually select a matching name, whereby the profile name will be different to the matched name
    4. If there is no matching name, or more than 1 matching name, then the system will present the user with options to
      1. Proceed with the name as they entered it, whereby the profile will NOT be matched to any name; or
      2. Manually select a matching name, whereby the profile name will be different to the matched name
  3. Once the first step is complete, the system will attempt to match the name against the NSL using the acceptable-name service
    1. If there is a single match, the profile will be automatically matched to that NSL name
      1. If the previous step, for any reason, did not identify the name authority, and the matched NSL name includes the authority, then the profile will use the authority from the NSL. This should rarely occur;
      2. The protolog of the NSL name will be retrieved from the NSL and will be displayed under the name on the profile pages when the profile loads.
    2. If there is no match, or multiple matches, then the profile will not be matched to any NSL name
  4. In cases where there is no matched name, the system will also present the user with the ability to specify the taxonomic hierarchy for the profile. For each rank within the hierarchy, the user can enter a name or select an existing name or profile. The editor can define as much of the tree as they desire, up to the point where they select a recognised name/profile: at that point, the new profile's hierarchy will merge into the known hierarchy. It is not possible the change the hierarchy of a known name. For example, an editor could create the profile for a new subspecies, and define the parent as the known species name, or they could create a profile for new species in a new genus, and join the existing hierarchy as by selecting a known family as the parent of the genus.

The user is also able to remove the matched name on the edit profile screen. This is useful for cases where the system automatically matched to the wrong name during a bulk import (see below for more information on bulk imports).

Choosing among multiple taxonomic concepts

The user is able to select the appropriate nomenclature/concept from a list of available concepts (sourced from the NSL's apni-concepts service), if the name was matched to an NSL name.

A future change will introduce support for multiple taxonomic trees within the NSL. The Profiles application will then be changed to allow a collection administrator to select the NSL tree to use for their collection - at that point, profile editors will not need to select the taxonomic concept for their profile, as it will be derived from the matched name and the NSL tree.

Name matching during a bulk import

  1. For each profile in the data set, the system will match the name using the ALA Name Matching API.
    1. If there is a single match, regardless of whether it was an exact match or not, then the system will match the profile to that name
    2. If there was no match, or multiple matches, then the system will not match the name
  2. If the source data includes an NSL name identifier, then the system will match the profile to that NSL name
  3. If the source data does not include an NSL name identifier, then the system will attempt to match the name against the NSL's Simple Name Export using the logic outlined in the flow chart below.

NSL Name matching process during the bulk data import

NSL Name matching flow chart

Choosing among multiple taxonomic concepts

There are several options for the selection of the nomenclature/concept during the import process:

  1. The source data can specify the NSL identifier of the nomenclature to use
  2. The import process can indicate that the system should attempt to match the profile to an appropriate nomenclature using one of the following rules (assuming the name has matched to the NSL) - the customer will need to nominate an option when providing the data:
    1. APC or Latest - if there is an APC concept, use it; otherwise, use the most recent concept
    2. Latest - use the most recent concept
    3. Containing Text - try to find the most recent concept where the name contains certain text (e.g. find the concept with "Flora of NSW" or "Flora of New South Wales")
    4. NSL Search - if the source data includes a concept reference, attempt to find that reference in the NSL using the find-concept service

It is important to note that any automated matching process during the bulk import will result in some level of inaccuracy.

Mismatched Names Report

The user interface provides a "Mismatched Names Report" for a collection (available to editors and administrators). This report will list any profile where:

  1. There is no matched name
  2. The matched name is different to the profile name
  3. There is no matched NSL name