Name indexer - AtlasOfLivingAustralia/documentation GitHub Wiki

Name indexing happens during the "processing" phase of ingestion. If a customised/national checklist is required, a new name index can be created. Details of the required structure are here.

Using the playbook nameindexer-standalone.yml

Assuming you have an instance of ALA portal running, when you need to use your own name index, or update the existing name index, you want to run it separately so it doesn't potentially break the running service.

So as the tutorial of building a customised name index, we start from a vanilla vagrant ubuntu box.

$ cd ala-install/vagrant/ubuntu
$ vagrant status

At this point, if you have a vagrant box running, consider destroy it and start a fresh one.

$ vagrant destroy
$ vagrant up

Ensure the virtual box is up-and-running and you can login:

$ vagrant ssh

Now, run the playbook for creating the instance that builds nameindexes:

$ cd ../../ansible
$ ansible-playbook -i inventories/vagrant nameindexer-standalone.yml --private-key ~/.vagrant.d/insecure_private_key -u root

Where the name sources are stored:

$ cd /data/lucene/sources

Observe the target folder before we do name indexing:

$ ls /data/lucene

It should have only 'sources' as a directory.

scp your zipped custom DwCA in to the directory /data/lucene/sources, and expand the zip.

Now, to create a name index, for instance:

$ sudo nameindexer -dwca /data/lucene/sources/dwca-col-mammals

(As 23 Jul 14, the default owner of /data is root so you need sudo to run nameindexer.)

Notice that here the 'dwca-col-mammals' is a directory extracted from dwca-col-mammals.zip. If you are working on your own checklist in DwC-A format, make sure you extract it before fire the nameindexer. Do $ nameindexer -help for more information.

Nameindex test search

Test search to see if the name index has been built successfully:

$ sudo nameindexer -testSearch "Macropus rufus"

And you should see:

Search for name
ID: 6863103
GUID: urn:lsid:catalogueoflife.org:taxon:d9f7aefa-29c1-102b-9a4a-00304854f820:col20120124
Classification: "(Desmarest, 1822)",Animalia,Chordata,Mammalia,Diprotodontia,Macropodidae,Macropus
Scientific name: Macropus rufus
Authorship: (Desmarest, 1822)
Rank: SPECIES
Synonym: null
Match type: exactMatch

That means you've got a working index.

Using your new nameindexer

After regenerating the name index, in your /data/lucene directory, you'll see two new directories:

namematching
nmload-tmp

To use the new name index, you can zip these two directories and unzip and replace the same one on your production site in the VMs that use nameindex (that is, bie-index, biocache-cli, specieslists, sandbox, biocache-service).

Reindexing your data with the new name index

The next step would be to update the occurrence solr index to use this new name index.

You can test the biocache by processing a single record using biocache process-single <Record UUID>.

Later you can re-process your data in the biocache with biocache process -dr [druid] and re-index your data to see the changes biocache index -dr [druid]. But probably you will need a full reindex of your data.

Supported checklist formats

@Todo : Text to add (maybe images as well)

About homonyms

The name indexing by default looks for /data/lucene/sources/IRMNG_DWC_HOMONYMS to analyse homonyms. If you have alternative homonyms to detect against, run nameindexer with -irmng flag and point to your own extracted homonym DwC-A.

To avoid obvious homonym indexing error, you can provide taxonomy hints in the Collectory-hub when you edit the metadata of data resources by providing a taxonomy hint. The URL would be /collectory/dataResource/edit/[druid]?page=%2Fshared%2FeditTaxonomyHints. (Replace [druid] with your own.)

About using the GBIF Backbone

In you want to use the GBIF Backbone Taxonomy, you have to take into account that right now ALA nameindex uses a mandatory scientificName without scientificNameAuthorship and an optional nameComplete, not like the GBIF backbone that uses a full scientificName following Darwin Core. Right now (Nov-2019) if you want to use the GBIF Backbone Taxonomy and avoid the duplication of authors in BIE (and other related issues), you can, for instance adapt the backbone using this utility to remove the author from the scientificName.

Or just download a GBIF backbone nameindex compatible with ALA

You can just download a GBIF nameindex compatible with ALA without following all these previous steps. For this:

  • Put this in your inventories:

custom_namematching_url = https://datos.gbif.es/others/nameindex-gbif-backbone-nov-2021-lucene-6.tgz
nameindex_to_use = custom

This it's now the default value in new generated or updated inventories, because if not the ALA nameindex it's used.

  • Use this tar in the first step of the bie-index admin tool (DwCA Import): the one generated by the previous utility (or configured that is configured directly by the la-toolkit and ala-install). You will save some hours even days of process time.

How to use ansible to update your custom nameindex and namedata

When you setup a custom nameindex this should be configured in the different VMs of the different services (biocache-service, biocache-store, species-lists, sds, ...) and you need also to configure this full source namedata in bie-index. Both parts, and should match. In ala-install you can use: grep "nameindex" ansible/*yml | grep -v nameindexer to see it.

This is too much work, so it's better to do it with ansible. For instance, when using the GBIF taxonomy we configure it like this. You can overwrite these variables with custom urls, dates, and checksums if you are not using the GBIF Taxonomy. Later you can do something like (if you are using some generated inventories): ./ansiblew --alainstall=YOUR_ALA_INSTALL_UP_TO_DATE all --tags namedata,nameindex -n to update your VMs that uses namindex or namedata. With one command you can update all the parts and components affected.

The zips/tars should have a similar structure to these ones of the generator of the default ones of ALA nameindex/namedata:

$ tar tvf nameindex-gbif-backbone-nov-2021-lucene-6.tgz | head
drwxr-xr-x root/root         0 2022-01-15 02:51 namematching-gbif-2021-nov-lucene-6/
drwxr-xr-x root/root         0 2022-01-15 02:51 namematching-gbif-2021-nov-lucene-6/irmng/
-rw-r--r-- root/root     24300 2022-01-15 02:51 namematching-gbif-2021-nov-lucene-6/irmng/_1_Lucene50_0.pos
-rw-r--r-- root/root       136 2022-01-15 02:51 namematching-gbif-2021-nov-lucene-6/irmng/segments_2
-rw-r--r-- root/root       158 2022-01-15 02:51 namematching-gbif-2021-nov-lucene-6/irmng/_1.nvm
-rw-r--r-- root/root         0 2022-01-15 02:51 namematching-gbif-2021-nov-lucene-6/irmng/write.lock
-rw-r--r-- root/root      3081 2022-01-15 02:51 namematching-gbif-2021-nov-lucene-6/irmng/_1_Lucene50_0.tip
-rw-r--r-- root/root       886 2022-01-15 02:51 namematching-gbif-2021-nov-lucene-6/irmng/_1.fnm
-rw-r--r-- root/root    220785 2022-01-15 02:51 namematching-gbif-2021-nov-lucene-6/irmng/_1_Lucene50_0.tim
-rw-r--r-- root/root       524 2022-01-15 02:51 namematching-gbif-2021-nov-lucene-6/irmng/_1.si

$ unzip -l gbif-backbone-2021-11-26.zip 
Archive:  gbif-backbone-2021-11-26.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
663679626  2021-12-03 23:48   Description.tsv
487984178  2021-12-03 23:49   Distribution.tsv
390616016  2021-12-03 23:49   Multimedia.tsv
1142364020  2021-12-03 23:49   Reference.tsv
1923275799  2021-12-28 13:41   Taxon.tsv
  8722097  2021-12-03 23:50   TypesAndSpecimen.tsv
116939561  2022-01-14 22:30   VernacularName.tsv
        0  2021-12-09 12:03   dataset/
    12743  2021-12-09 12:02   eml.xml
     6591  2021-12-03 23:50   meta.xml
---------                     -------
4733600631                     10 files

So that nameindex is configured on all your servers with services that need it (biocache-service, biocache-store, ....) and the namedata in bie-index. Later, you can import the namedata in bie-index to get the pages for your species, and so on. The rest of components will use the nameindex to index occurrences, to search, to match species in species-lists, ...

More information

For more name indexing information read and follow the Guide to Getting Names into ALA.

This bie-index wiki page about the full reindex tasks is outdated but quite informative about the whole re index process.