AtlasSubsysQuality - AtlasOfLivingAustralia/ala-datamob GitHub Wiki

Introduction

This is not an attempt to document the quality sub-system - merely a pointer for the reader to the volume of information on this many-faceted beast (based on the understanding of someone who has developed some of these tools, and actively uses most of them for quality content analysis)

Quality completeness (or just completeness)…

is the 'plentitude' of data available to the consumer. A sparse record may have limited fitness for use, where as more complete data will offer a firmer foundation to draw conclusions from.

Quality or content analysis

is distinct from completeness analysis ... it is the result of an effort to correct, enhance, or otherwise cross-reference individual properties within the source data, in the context of a wider dataset. It involves structural or statistical machine algorithms, and/or human analysis - the more complete the data being analysed, the more algorithms can be applied.

Quality analysis performed by this subsystem :

  • is domain-specific; a lot of the tests performed make sense in the broader domains of taxonomy, biodiversity and geospatial analysis
  • is statistical; many tests are founded on logic and maths, e.g: XXX
  • can help users improve the 'fitness-for-use', or completeness of their data as it relates to Darwincore

ALA data quality wiki

Firstly, there is an Atlas quality wiki: http://code.google.com/p/ala-dataquality/. Most of the information here directs the reader to this site, directly to source code, or to logic related to data quality.

Quality checks spreadsheet

This document hosts a spreadsheet of the tests performed by the quality analysis sub-system: http://goo.gl/0ICnO or http://code.google.com/p/ala-dataquality/ -> Data Quality Checks Spreadsheet

When does 'quality happen' in the Atlas?

The goal of this section of the wiki is to help the user understand how quality fits in the overall data mobilisation. It would be easy to say 'everywhere', but for new users that offers only a 'trendy .5 menu' that 'does nothing for their hunger'... so let's tuck in!

A logical DM process

There is a series of considerations available for the reader as they decide which way to tackle their implementation, but for the purposes of 'when' and 'quality' a definition of the states of data may be helpful:

  1. before standardisation - (in the source-system) data are in a schema specific to the institution (or process, or collection management software, ...)
  2. after standardisation - (at time of export, preparation for export, ...) data are mapped to simple-dwc, possibly with non-standard extensions
  3. after ingest - once data have been through the ingest process, they are available in the biocache
  4. after biocache export - data can be exported from the biocache

From a systems perspective, data can exists in these states in one or many systems (ie, data or code bases) but we are concerned with their transition from one to the next, and what quality tools are applicable or available.

TODO - flesh this out a little more

Quality: before standardisation

  1. bulk export from data store
  2. columnar transformation for distribution analysis
  3. completeness analysis spreadsheet
  4. darwincore specification and the darwincore wiki

Quality: after standardisation

  1. atlas biocache sandbox

Quality: after ingest

  1. biocache facets
  2. spatial portal

Quality: after biocache export