A Guide to Getting Names into the ALA - AtlasOfLivingAustralia/documentation Wiki
- How It Works
- What You Will Need
- Installing and Using a New Names
How It Works
The ALA is basically a great big data cube. The ALA uses names as a way of indexing this cube so that users can structure data according to taxonomy (e.g.: I only want records from the family
Cassidini). This guide allows you to set up the Biodiversity Information Explorer (
BIE) and name matching indexes with your own taxonomies.
biocache holds occurrence records – this animal/plant was seen here, at this time, by this person, what, where, when and who. The information in the
biocache is indexed by a
solr index, which allows people to search the
biocache for things they are interested in. The name matching index contains a taxonomy suitable for processing. The supplied information in every occurrence record in the
biocache is matched against the name matching index and the occurrence record is annotated with things like the matched name, higher taxonomy, quality of match etc. The link between the name matching index and the
biocache is the
guid, which gives a unique identifier for each species, suitable for indexing in
BIE holds organising information – this species, this dataset, this locality, this region, this webpage. A person can search the
BIE as a first entry point into the ALA and get back a number of references to things that might be of interest. In particular, the
BIE holds species and taxonomy information. It also holds references to, more or less, anything that can be used to search the
collectory holds metadata about the datasets, data providers, collections and institutions the provide data to the
biocache. The metadata particularly holds a description, URLs, contact information and licencing and copyright information. As well as datasets, it can also hold metadata about things like webpages, lists of species, etc.
The basic use of the ALA is that a user goes to the
BIE and types in a name. The
BIE will search for the name and give the user a set of options. The user can then click on the link that is closest to what they want. If that link is a species (or genus etc.) page then the user can ask to be shown all the records in the
biocache which match the
What You Will Need
- To begin with, an installed instance of the ALA, with the
collectory. See the LA Quick Start Guide.
- Ensure that the following directories exist:
- Ensure that the following directories exist:
ala-name-matching libraryand programs from https://nexus.ala.org.au
- What you want for the current installation is https://nexus.ala.org.au/service/local/repositories/releases/content/au/org/ala/ala-name-matching/2.4.7/ala-name-matching-2.4.7-distribution.zip
- You can unzip this distribution anywhere you want to build the name matching index. This can be your personal computer, you just need a Java8 installation.
- See https://github.com/AtlasOfLivingAustralia/ala-name-matching for more information
- You will also need the
DwCA, for homonym detection. You can get this from http://www.irmng.org/export/ Unzip this into
/data/lucene/sources/IRMNG_DWC_HOMONYMSon the machine where you plan to run ala-name-matching
Talend Open Studiofrom https://www.talend.com This is useful for building the taxonomy Darwin Core Archive described below.
Installing and Using a New Names
In the examples, we are building an archive for
sibbr. You can use any name that suits you.
Step 1: Build the Darwin Core Archive
You can do this any way you want to. What you need as an output is a Darwin Core Archive (
DwCA) containing information covered in the Taxon profile of Darwin Core . The result needs to follow the conventions described in https://github.com/AtlasOfLivingAustralia/bie-index/blob/master/doc/nameology/index.md
At a minimum, though, you will need a
taxon.csv taxonomy file, a
meta.xml description and a
eml.xml metadata description. The
DwCA needs to be structured to have a
parentNameUsageID (for accepted),
acceptedNameUsageID (for synonyms),
taxonomicStatus following the conventions listed above. You can add other information as you see fit.
The Gbif Darwin Core Archive Assistant can help you decide on the terms and structure of the archive.
The ALA uses
Talend to pull together the various data sources, transform them into Darwin Core following the nameology conventions and building a
DwCA. You don’t have to do this; you can use anything that archives the correct result.
If you have multiple, overlapping taxonomies, things get more complicated. You will need to use the Large Taxon Collider, described at https://github.com/AtlasOfLivingAustralia/ala-name-matching/blob/master/doc/large-taxon-collider.md
Step 2: Build the Name Matching Index
Where you have unzipped the name matching distribution, run the command:
java -jar ala-name-matching-2.4.7.jar -all -dwca /path/to/DwCA
/path/to/DwCA is the path to the directory where the unzipped
DwCA is. If you want to see all the possible options, run
java -jar ala-name-matching-2.4.7.jar -h
The resulting name index will be found in
/data/lucene/namematching Any previous name matching index will be renamed.
For copying around, it’s usally best to zip up the namematching directory. Say
zip -r namematching.zip namematching
You can also use the nameindexer role to perform that task.
Also: On a VM with nameindexer installed and which includes default DwCA from the Catalog Of Life, to create your own nameindex with vernacular names:
- rename /data/lucene/sources/col_vernacular.txt (so nameindexer can't find this default file)
- put your DwCA in a sub-folder and include at least these:
- (eml.xml does not appear to be required.)
- meta.xml with column-mappings for your species file and vernacular file
- Species file (csv/txt. Header not required. See Step 1 above for required fields.)
- Vernacular file (csv/txt. Header not required.)
nameindexer -all -dwca /path/to/your/dwca
Note that you do NOT include the -common switch to include/process your vernacular file via your meta.xml file. This method overrides the default behavior for including a vernacular file. If you see errors like this:
2020-01-01 12:00:00,000 INFO : [DwcaNameIndexer] - Issue on line 10000 1234567
This is likely the result of trying to use the -common switch with your own vernacular file whose columns do not match the default column-mapping expected by nameindexer.
Step 3: Copy the Data to the Server
Do not have any occurrence records being imported or processed while you are doing the next steps.
DwCA and namematching to the server.
Put the contents of the DwCA into
/data/bie/import/sibbr and change ownership to
chown -R tomcat7.tomcat7 /data/bie/import/sibbr It is important that you change ownership, otherwise the BIE may have trouble importing the archive.
Put the contents of the
/data/lucene/namematching If you are changing name matching indexes often, it is often a good practice to datestamp the directories (eg
namematching-20180921) and use a symbolic link from
Step 4: Import into the BIE
Before doing this, have a look in
bie-index-config.yml file contains a list of steps through which the import process goes through. You can adjust these steps to suit what you have. For example,
You may also want to modify the contents of
image-lists.json in the same directory. These are documented at https://github.com/AtlasOfLivingAustralia/bie-index
Once you are happy, go to your server
http://localhost/bie-index/admin and choose the "Import All" option. Click on the button and watch the log expand as it steps through the elements in the import sequence. The above sequence will first pull in all the data providers, data resources, collections, institutions, etc. from the
collectory, then import all the taxa from
/data/bie/import then denormalise the taxonomy and link synonyms, then load conservation status information, then scan for unique human readable links to species, then scan for images and finally load an estimate of the occurrences for each taxon into the index.
Step 5: Configure your Species Subgroups
Also probably you should [[configure your species subgroups]] to match this new nameindex hierarchy.
Step 6: Reprocess and Reindex the biocache
Since we have a new name matching index, the entire biocache needs to be reprocessed to match the supplied names against the new index.
Once you have reprocessed the biocache, you need to re-index it with
The result will be a new biocache index.
Step 7: Swap Cores
BIE serves data from a
solr core called
bie. It imports data into a core called
bie-offline. Once complete, the cores need to be swapped so that what was
bie and what was
bie-offline, ready for the next load. To swap cores, you need to go to http://localhost:8983/solr choose "Core Admin" and swap the two cores.
biocache index first needs to be imported into
solr. Again, choose "Core Admin" and choose "Add Core" with an instance dir of
/data/solr/data/biocache and a data dir of wherever the new index is located. Then swap the
biocache and new cores.
More details and screenshots in SOLR Admin tasks page.
- This bie-index wiki page about the full reindex tasks is outdated but quite informative about the whole re index process.