Data Quality Development Discussion - AtlasOfLivingAustralia/data-management GitHub Wiki

Data Quality issue summary

Documentation:

QA block at the top of a record list with facets on

  • Name matching and identification certainty
  • Spatial quality – coordinate precision and uncertainty
  • Sensitive
  • Record type – basis of record
  • Data currency – event date range and update date range
  • Outlier layers
  • User assertions
  • Associated records

Collections: Expert distribution maps

  • Colour of expert distribution maps (default dark blue is hard to see)
  • Zoom into expert distribution maps when distribution is small
  • Consistent naming – “compiled distribution map” on species pages “expert distribution map” in spatial portal

Assertions/annotations management Support for collection management software

  • Ensure that the Assertions API is capable of identifying an official ALA partner in some way different from other users.
  • Enable the Assertions API to be search via catalogNumber (or other similar fields), so that the API can be asked to retrieve the specimen by a unique identifier known to the collection. If we don't do this, the collection has to have a way to find the UUID of a specimen in the ALA infrastructure before the assertions for that specimen can be retrieved. This would be an extra burden on the collection or the CMT that should be relatively easily removed.
  • Write a document that explains this feature to the CMT makers in a way that gets them writing code as quickly as possible.
  • Provide an email address, or a developer who can be available to answer questions (from CMT developers) about this feature in a timely manner.

Mobilising annotations

  • One note raised during the symposium was that there is no way currently to transmit assertions to GBIF because Darwin Core currently has no standard on assertions; work with GBIF and the Darwin Core maintainers to get assertions added.

Authoritative annotations

  • Rank an annotation as coming from an authority
  • Manual “misidentified” annotation from an authority (records are excluded from being representative and highlighted as misidentified (e.g. red border on image))

Duplicates

  • Manual “duplicate” annotation from an authority
  • Add collecting/field number to duplicate detection

Batch annotations

  • Select a set of records and give them the same annotation

DigiVol

  • change all BVP, Biodiversity Volunteer Portal, Digi Vol etc to DigiVol
  • duplicates – draft digivol records the come back through the collection loads once validated
  • turn digivol record into a stub and link to collection record? Data Authority
  • colour code points – Green = top quality data, specimen identification and associated data either created or confirmed by a specialist; Yellow = probably OK but, for example, data obtained from grey literature, taxonomy and data not recently confirmed, data based on identification of specimen in a collection, but accuracy unknown; Red = very questionable, data point not supported by a specimen, contributed by a non-specialist, field observation only, etc.

Full resolution downloads

  • Delivery of data to government departments without locational smearing?

Literature References

Sampling effort

  • some way of looking at sampling effort at various taxonomic levels by frequency/abundance vs environmental factors eg. be able to plot frequency of a family by depth and then also show the class frequency by depth on the same graph to see if the occurrences for a family are similar to those of the class.
  • the known occurrences of a species in one colour against a grey background of the class providing a visual indication of whether the taxonomic level you’re looking at is from an area where sampling would suggest it would have been found or from an area that is poorly sampled.

Name matching

  • matching to higher taxonomy can be very coarse if a major revision has been done (currency of names problem)

Original vs Processed

  • visibility of original vs processed – move from a pop-up to toggle on record details?

Images

  • image voting - ability to flag as incorrect and flag as good quality/representative
  • General image tagging?

Data load

  • remove values when a null is provided in the data feed
  • parsing of multiple collectors – check pipe and semicolon delimeters are handled correctly
  • identification qualifier processing – map to vocab or provide a summary assertion “uncertain identification” if any uncertainty value is provided
  • type status processing – herbaria not providing complete type status data. PERTH delivers the ABCD typeStatus, which is just the type of type ('holotype', 'isotype', etc., but not the typified name.)

Check biocache-store for DQ issues

Images: image-service issue on duplicate images, which could be part of data quality work... https://github.com/AtlasOfLivingAustralia/image-service/issues/26. I also noticed we have lots of images with missing image files, which needs addressing too... http://images.ala.org.au/image/details/66140333 http://images.ala.org.au/store/e/9/1/0/2d944025-4521-421b-92a5-ead7fb04019e/original strange Flickr message