CompletenessModel - AtlasOfLivingAustralia/ala-dataquality GitHub Wiki

Completeness models

Completeness analysis puts the mapping and design process on rails.
A. Tindall

The quality completeness model (completeness model, or CM) is a first step toward analysing the fitness for use, or improving the quality at source, of a data export. Completeness analysis allows a user to look at the availability of supporting information, without specifically delving into content analysis (i.e. looking at the values in the data).

Each model is tied to a specific biodiversity informatics' data standard; there are two in the works:

The models generally don't cover their respective specifications entirely (i.e. don't reference all fields in the data standard). They can be considered a tool to aid in a deeper understanding of the source data, as well as the desired transformation standard.


Quick start summary for a new completeness model ##

New DwC completeness model

(Note: visit CompletenessModelDwC for details...)

  1. open the master spreadsheet: * https://docs.google.com/spreadsheet/ccc?key=0AiNWJFdh4pHZdHJnQU01bXlDeUdaMkJ5Z1EwN01HaGc&hl=en_US#gid=0 -or- http://goo.gl/XNehZ
  2. create a copy: File menu -> Make a copy...
  3. store your existing (or planned) mapping in a new file somewhere: * create a spreadsheet using excel, google docs, openoffice caclc, etc... * fill column a with the target dwc term, and if possible, column b with the source field (optional)
    • tip if you have an existing export that you want to analyse, copy the header row and 'paste special->transpose' (or equivalent) into a new spreadsheet; this should transform the fields from 'one field-name per column' to 'one field name per row'... * export this list of fields to a new text file, tab-delimited
  4. go back to your new google docs spreadsheet and import the column names into the mapping sheet: * File menu -> Import... * select the file you created earlier, and choose Append rows to current sheet

New HISPID completeness model

2011/12/21 - tba... finger out !

Physical implementation

The models manifest themselves as a google docs spreadsheet with the following parts:

  • schedule of source to target fields (Mapping, first sheet),
  • a table of these terms with additional information QualityRelationships, second sheet), and
  • a set of rules that highlight potentially absent data (completeness tests).

The QualityRelationships sheet cross-references data-source information in the Mapping sheet with two distinct, logical, groupings of terms from the data standard. It is designed to aid the user in better understanding their source-data and its mapping to the standard. It also includes value metadata and pointers to other related terms.

The completeness tests are a series of logic that highlights potentially absent data by working through the mapping and relationship sheets... see the Implementation details section in the details document for DwC (CompletenessModelDwC) or HISPID (CompletenessModelHISPID) for more info.

Completeness test example

A notable example of a completeness test would be the datum that is associated with geographical coordinates, or the projection that is associated with a grid-reference. see: Handling spatial location data - http://goo.gl/7Dwrl for more details on this topic

Without the datum or projection, there is an additional level of uncertainty inherent in the coordinates when a data consumer beings an analysis. Including this additional information (i.e. metadata) improves the overall fitness for use.

Tests also exist for:

  • ensuring the logical components of a record can be uniquely identified,
  • institutional (custodian?) and researcher (owner?) provenance,
  • data usage rights and attribution,
  • spatial and temporal data,
  • research methodology and publication,
  • taxonomy,
  • ...

Potential uses

When planning a data export overall completeness could be considered best practise, however, the source data might not allow for this.

This highlights a number of other potential uses for a completeness analysis:

  1. new data store design,
  2. existing data store and data-capture process enhancement,
  3. historical gap analysis & automated quality-assessment,
  4. supplementary data export planning,
  5. ...

Existing completeness models

DwC files

google docs -> CompletenessDwC http://goo.gl/WZj7Z -or- https://docs.google.com/open?id=0ByNWJFdh4pHZZTExOTU5ZTMtNjE4Yy00ODk0LWIxNzktOGRhZTczZDgxMmUz

HISPID files

google docs -> CompletenessHISPID http://goo.gl/TPsTb -or- https://docs.google.com/open?id=0ByNWJFdh4pHZZDMzNDA2ZWEtOTQ5OS00MDZjLTg2MDItMGZlYmE0NjU4MzYw


Future plans for completeness models

⚠️ **GitHub.com Fallback** ⚠️