Data Quality Development Discussion - AtlasOfLivingAustralia/data-management GitHub Wiki

Data Quality issue summary

Documentation:

QA block at the top of a record list with facets on

Collections: Expert distribution maps

Colour of expert distribution maps (default dark blue is hard to see)
Zoom into expert distribution maps when distribution is small
Consistent naming – “compiled distribution map” on species pages “expert distribution map” in spatial portal

Assertions/annotations management Support for collection management software

Ensure that the Assertions API is capable of identifying an official ALA partner in some way different from other users.
Enable the Assertions API to be search via catalogNumber (or other similar fields), so that the API can be asked to retrieve the specimen by a unique identifier known to the collection. If we don't do this, the collection has to have a way to find the UUID of a specimen in the ALA infrastructure before the assertions for that specimen can be retrieved. This would be an extra burden on the collection or the CMT that should be relatively easily removed.
Write a document that explains this feature to the CMT makers in a way that gets them writing code as quickly as possible.
Provide an email address, or a developer who can be available to answer questions (from CMT developers) about this feature in a timely manner.

Mobilising annotations

One note raised during the symposium was that there is no way currently to transmit assertions to GBIF because Darwin Core currently has no standard on assertions; work with GBIF and the Darwin Core maintainers to get assertions added.

Authoritative annotations

Rank an annotation as coming from an authority
Manual “misidentified” annotation from an authority (records are excluded from being representative and highlighted as misidentified (e.g. red border on image))

Duplicates

Batch annotations

DigiVol

change all BVP, Biodiversity Volunteer Portal, Digi Vol etc to DigiVol
duplicates – draft digivol records the come back through the collection loads once validated
turn digivol record into a stub and link to collection record? Data Authority
colour code points – Green = top quality data, specimen identification and associated data either created or confirmed by a specialist; Yellow = probably OK but, for example, data obtained from grey literature, taxonomy and data not recently confirmed, data based on identification of specimen in a collection, but accuracy unknown; Red = very questionable, data point not supported by a specimen, contributed by a non-specialist, field observation only, etc.

Full resolution downloads

Literature References

The section “references” is just an unorganised accumulation – on Thysanoptera there are over 2000 references of worldwide literature with no way of knowing what, if anything, is relevant without opening each one. http://bie.ala.org.au/species/urn:lsid:biodiversity.org.au:afd.taxon:c11433d8-6064-42ba-9279-8aab58bee450#tab_literature

Sampling effort

some way of looking at sampling effort at various taxonomic levels by frequency/abundance vs environmental factors eg. be able to plot frequency of a family by depth and then also show the class frequency by depth on the same graph to see if the occurrences for a family are similar to those of the class.
the known occurrences of a species in one colour against a grey background of the class providing a visual indication of whether the taxonomic level you’re looking at is from an area where sampling would suggest it would have been found or from an area that is poorly sampled.

Name matching

matching to higher taxonomy can be very coarse if a major revision has been done (currency of names problem)

Original vs Processed

visibility of original vs processed – move from a pop-up to toggle on record details?

Images

image voting - ability to flag as incorrect and flag as good quality/representative
General image tagging?

Data load

remove values when a null is provided in the data feed
parsing of multiple collectors – check pipe and semicolon delimeters are handled correctly
identification qualifier processing – map to vocab or provide a summary assertion “uncertain identification” if any uncertainty value is provided
type status processing – herbaria not providing complete type status data. PERTH delivers the ABCD typeStatus, which is just the type of type ('holotype', 'isotype', etc., but not the typified name.)

Check biocache-store for DQ issues

Images: image-service issue on duplicate images, which could be part of data quality work... https://github.com/AtlasOfLivingAustralia/image-service/issues/26. I also noticed we have lots of images with missing image files, which needs addressing too... http://images.ala.org.au/image/details/66140333 http://images.ala.org.au/store/e/9/1/0/2d944025-4521-421b-92a5-ead7fb04019e/original strange Flickr message