CompletenessModelDwC - AtlasOfLivingAustralia/ala-dataquality Wiki

Introduction

This model is a first step toward analysing the completeness of a dwc export; it doubles as a tool to help people out with designing new data stores, and documenting the decisions & compromises made as exports evolve.

this analysis tool lives in the spreadsheet linked below; it's used to see what other data might be added to an export to enhance its usage potential (it doesn't do any content analysis)

the executive summary of the spreadsheet is this:

Please note: a general introduction to the concepts referred to in this page lives at CompletenessModel; (this is also part of a series of pages relating to data mobilisation)

Create a new DwC completeness model

  1. open the master spreadsheet: * https://docs.google.com/spreadsheet/ccc?key=0AiNWJFdh4pHZdHJnQU01bXlDeUdaMkJ5Z1EwN01HaGc&hl=en_US#gid=0 -or- http://goo.gl/XNehZ
  2. create a copy: File menu -> Make a copy...
  3. store your existing (or planned) mapping in a new file somewhere: * create a spreadsheet using excel, google docs, openoffice caclc, etc... * fill column a with the target dwc term, and if possible, column b with the source field (optional)
    • tip if you have an existing export that you want to analyse, copy the header row and 'paste special->transpose' (or equivalent) into a new spreadsheet; this should transform the fields from 'one field-name per column' to 'one field name per row'... * export this list of fields to a new text file, tab-delimited
  4. go back to your new google docs spreadsheet and import the column names into the mapping sheet: * File menu -> Import... * select the file you created earlier, and choose Append rows to current sheet

Implementation details: Mapping and QualityRelationships sheets

Mapping sheet

description of the mapping sheet...

<wiki:gadget url="http://hosting.gmodules.com/ig/gadgets/file/117808631063490062819/url.xml" up_Url="https://docs.google.com/spreadsheet/ccc?key=0AiNWJFdh4pHZdHJnQU01bXlDeUdaMkJ5Z1EwN01HaGc&hl=en_US#gid=0" up_Title="DwC quality completeness v2.6 - master copy.Mapping" up_height=460 width="840" up_refresh=6000 />

DwC quality completeness v2.6 - master copy.Mapping sheet

TargetField

This should be a data-standard term's (short) name; it must be exact for the lookup algortithm in QualityRelationships to match this row.

SourceField

If this column has no value but a matching column a value, the following behaviour occurs in QualityRelationships :

Otherwise, values entered here will appear in the QualityRelationships.Mapped column field. There are some special strings in this cell that impact on the conditional formatting in QualityRelationships :

Status, last update

Other

QualityRelationships sheet

description of the qr sheet...

<wiki:gadget url="http://hosting.gmodules.com/ig/gadgets/file/117808631063490062819/url.xml" up_Url="https://docs.google.com/spreadsheet/ccc?key=0AiNWJFdh4pHZdHJnQU01bXlDeUdaMkJ5Z1EwN01HaGc&hl=en_US#gid=1" up_Title="DwC quality completeness v2.6 - master copy.!QualityRelationships" up_height=460 width="840" up_refresh=6000 />

DwC quality completeness v2.6 - master copy.QualityRelationships sheet

ALA quality facet

The completeness model introduces a complementary method for grouping the DwC terms – quality facets. These facets prove common to most of the existing logical groups already defined (see DwC group link).

As a result of this commonality, the quality facets can also be seen as a series of steps for mapping to, or assessing completeness of, each DwC group.

This completeness assessment encourages 'quality at source', and is a pre-cursor to more comprehensive analysis of the content.

Both the completeness assessment and the content analysis contribute to an overall assessment of the data quality, and for the end user to determine fitness for use.

Current values are:

  1. basis - founds the digital record: 1. as unique within a system, 2. as related to other versions of the same
  2. precision - term is used as an indicator of the precision in the DwC group; and uncertainty - a statement about the uncertainty of data in the DwC group
  3. spatial - place/space pointer; and temporal - time pointer
  4. verbatim - storing the raw data if any parsing/processing has occurred (allowing for reproduction/reinterpretation)
  5. notes - free(er) text, related to some concept, in a particular DwC group; and verification - pertaining to the veracity of the DwC group
  6. ID - system-unique identifier of a concept related to the DwC group

Type

  1. area - value identifies an area in some spatial standard
  2. date - value is a date in some temporal standard
  3. elevation - value identifies an elevation (+/-) in some spatial standard
  4. multiple - values, separated according to dwc spec.
  5. point - value defines a single point in space using some standard
  6. range - value is a range defined according to dwc spec.
  7. referent - an entity external to these system(s), eg: a person, publication, organisation
  8. unique - value must be unique within this basis (see quality facet: basis)
  9. uri - a digital reference to a concept arbitered in/by another system (implies a constrained set of valid values)
  10. value - no specific restrictions, but spec. may define 'best practise'
  11. vocab - recommended controlled vocab

DwC group link

The darwin core grouping of terms, linked to the wiki discussion on the TDWG website.

DwC term link

Mapped column

The field in the 'mapping' sheet associated with this dwc term.

Verbatim term

Quality rel terms

Other columns

Mapped-not-null


Implementation details: Completeness tests