RelatedRecords - AtlasOfLivingAustralia/ala-datamob Wiki

This document http://code.google.com/p/ala-datamob/RelatedRecords is a draft proposal for handling relationships between data records. The intent of this proposal is to ensure the data standards and best-practice support communicating these relationships if the source data capture them. At 07/12/2011 this document is still in draft form first release for peer review; it supersedes (and subsumes) the earlier IdentifyingRelatedRecords document, also on this wiki.

This document applies to some of the entities represented in biological collection data (i.e. collection items), with a focus on associations amongst specimen (or sighting) records, along with duplicates, other evidential data, or rich content (i.e. multimedia). It is a retrospective analysis of two commonly-used specifications for biodiversity informatics data transfer: DarwinCore 2.x and HISPID 5. ABCD 2 is indirectly referenced, of which HISPID5 is implemented as an extension. The HISPID specific use-cases may conflict with ABCD-usage; an attempt may be made in the future to highlight these, but at the moment the reader should be cautious of this possibility. See About the data and Appendix: Relevance to different biodiversity domains for more information.


Requirements

prioritised using MoSCoW - http://en.wikipedia.org/wiki/MoSCoW_Method

General (functional) behaviour required of the collection management system

  1. Each record that is to be shared must have a consistent and predictable unique identifier. The record will most likely represent a collection item, but may also represent other related entities, such as the collector, a location or a site visit;
  2. This identifier should be set by the data custodian, in accordance with a pattern agreed to by the relevant institutional group (i.e. informatics peak body);
  3. If the record represents data that are derived, or duplicated, from other data (see About the data : Scenarios giving rise to relationships) then this relationship should be stored in the collection management system, and communicated within the record using an appropriate method for the data-standard being used to share information (see About the data : Common patterns in the scenarios);

Aggregator-specific behaviour

  1. Due to the high likelihood of inconsistency, the data aggregator must not build a single composite record from related records that are determined to be duplicates or constituent parts, without: * highlighting the inconsistencies, * clearly identifying the provenance of such a record as being the aggregator, * not aggregating provenance-specific information (e.g. DwC.institutionCode, HISPID.SourceInstitutionID), or replacing these data with aggregator-specific information
  2. The aggregator shouldn't fail to ingest a partially complete data set or record when encountering unrecognised (non-standard) terms (fields, concepts) in the source data;
  3. The aggregator should use methods (see About the data : Common patterns in the scenarios) to discover the existence of related records and clearly offer links to, or notification of, these records;
  4. The aggregator could provide a (web?) service that: * responds to resource identifiers with a method for accessing the resource itself (freely or securely), * responds to the sub-parts of an identifier with best-guess identifier(s) based on current (and historical) patterns used by the relevant institutional group, * and potentially, offer the identifiers of related records as part of the response.
  5. The aggregator could remember relationships between records, and allow users to navigate this graph;
  6. The aggregator could infer relationships between records, and allow users to confirm these, which it would also remember;

Changes to the domain-specific extensions

07/12/2011 NOTE: These recommendations are likely to evolve further...

This document refers to simple-dwc, dwc-archive, dwc-xml, hispid-light, hispid-xml - refer to Appendix: Transport (file) formats in the specifications for more details on these concepts.

(for more details on these requirements, see Proposed specification changes)

... to FCIG-DwC

  1. a mechanism to relate a General record to its principal as well as its parent records, one-to-one for a maximum of two relationships, in all forms of dwc (simple, archives and xml);
  2. dwc-archive and dwc-xml only, ensuring DwC.ResourceRelationship can relate a record to other record(s), one-to-many (many-to-many?), n relationships;

... to HISPID-ABCD

  1. a mechanism relate a Unit record to its principal as well as its parent records, one-to-one for a maximum of two relationships, in all forms of hispid;
  2. a mechanism to ensure Unit-nested records point to the parent Unit explicitly in the data (allowing for these records to be communicated without nesting) : * a UnitID/UnitGUID pointer in sub-classes; * DateLastEdited information for each of the sub-classes; * nested sub-classes include: Unit.Acquisition, Unit.Assemblages, Unit.Associations, Unit.Gathering.SiteImages, Unit.MultiMediaObjects, Unit.Sequences, SpecimenUnit.Preparations, SpecimenUnit.History.PreviousUnits, ...
  3. hispid-xml only, ensuring HISPID.Unit.Associations can relate a record to other record(s), one-to-many, n relationships; * consideration must also be given to relating Unit-nested records to other Unit-nested records;

About the data

Core (primary, parent) and supplementary (duplicated, derived, secondary, child) data

In the scope of this logic, core collection data are generally the data that find themselves at the source of other data (causally). They are distinguishable to supplementary collection data, which are usually derived in some way; e.g.

The organisational boundary most likely plays a role in the distinction as well; e.g.

Suffice it to say, types of data are likely handled in separate (sub-)systems with separate processes for managing them - this point forces us to reconsider some cases; e.g.

Consideration must also be given to the completeness (https://github.com/AtlasOfLivingAustralia/ala-dataquality/wiki/CompletenessModel) of the data stored in sub-system(s) as an aggregate whole, or the ability to draw these data into a whole at the time of export; e.g.

It seems highly unlikely that any attempt to apply labels to organisations' activities will not bring to light cases that contradict the reasoning used to assign these labels.

As such, core or supplementary can be considered arbitrary labels used to distinguish the roles data play in relationships, analogous to parent and child respectively, with the latter in both cases implying a subordinate relationship to the former.

All labels are used interchangeably in this document.

Scenarios giving rise to relationships

A scenario can be thought of as 'a circumstance of the data / data management', i.e. it's a way someone is gathering and managing their collection data which results in a relationship being formed, or captured, between two (or more) records.

<wiki:gadget url="http://hosting.gmodules.com/ig/gadgets/file/117808631063490062819/url.xml" up_Url="https://docs.google.com/spreadsheet/pub?hl=en_US&key=0AiNWJFdh4pHZdDY2cEgxdHZ5YlBoTUx4MGVfVTgyTGc&hl=en_US&gid=6" up_Title="matrix" up_height=640 width="768" up_refresh=6000 />

From http://goo.gl/TdSLD

Common entities in the scenarios

These are logical (, generic, abstract) 'things' that participate in the scenarios, i.e. the things between which relationships exist; in the scope of the document, all scenarios can be broken down to interactions between:

Entities are limited to being either the 'object' or 'subject' of a relationship.

There are obviously other metadata associated with each entity, such as the collector, or the location, or the date, or the url, etc. - these are beyond the scope of the document and therefore remain unaddressed; e.g. the relationship between a specimen and its collector is clearly defined by the group of collector-related metadata in whichever spec the reader is concerned with...

<wiki:gadget url="http://hosting.gmodules.com/ig/gadgets/file/117808631063490062819/url.xml" up_Url="https://docs.google.com/spreadsheet/pub?hl=en_US&key=0AiNWJFdh4pHZdDY2cEgxdHZ5YlBoTUx4MGVfVTgyTGc&hl=en_US&gid=8" up_Title="matrix" up_height=590 width="768" up_refresh=6000 />

From http://goo.gl/TdSLD

Common concepts in the scenarios

Concepts could be considered more completely as 'concepts pertinent to record relationships' ... the concepts are also a reduction of the scenarios to a more common base - but a step in a slightly different direction than that of 'scenario and entity'.

These is show that there are key concepts already addressed in the standards, and there are these terms to communicate these facts to data consumers.

The concepts in the scenarios are not limited to being either the 'object' or 'subject' of a relationship, whereas the entities are.

<wiki:gadget url="http://hosting.gmodules.com/ig/gadgets/file/117808631063490062819/url.xml" up_Url="https://docs.google.com/spreadsheet/pub?hl=en_US&key=0AiNWJFdh4pHZdDY2cEgxdHZ5YlBoTUx4MGVfVTgyTGc&hl=en_US&gid=13" up_Title="matrix" up_height=650 width="768" up_refresh=6000 />

From http://goo.gl/TdSLD

Common patterns in the scenarios

This section attempts to identify instances of common ground shared amongst the scenarios, i.e. a 'lowest common denominator', with the goal of relating each pattern to a prescribed usage within a particular standard.

For each discrete combination of entity-pairs, there is a pattern - a matrix; another way of putting it would be 'entity combinations'. Once these patterns are identified, we can use them to give rigour to the standards analyses in later sections (synopses); i.e. for each pattern, determine if it is currently supported by the standards, or if we need to address it with a non-standard extension.

(For some thoughts on how these patterns are identified, see Appendix: Searching for common patterns in the scenarios)

<wiki:gadget url="http://hosting.gmodules.com/ig/gadgets/file/117808631063490062819/url.xml" up_Url="https://docs.google.com/spreadsheet/pub?hl=en_US&key=0AiNWJFdh4pHZdDY2cEgxdHZ5YlBoTUx4MGVfVTgyTGc&hl=en_US&gid=1" up_Title="matrix" up_height=415 width="768" up_refresh=6000 />

From http://goo.gl/TdSLD


Proposed specification changes

Introduction

This section offers general information on changes, relevant to both standards. For the standard-specific elaboration on the Requirements section at the front, see Changes to the ... sections that follow.

It is a goal of this proposal to define consistent methods for communicating relationships between records; in all cases, the following design-constraints are considered:

  1. be flexible but solve the problem, and accept that there are multiple solutions - let the data provider choose the best fit;
  2. use the existing standard terms where ever possible, unless the use of these terms in this manner would cause a conflict with the general understanding and/or current practice;
  3. add terms only when the following conditions are met: 1. the term name is defined which at least hints at the purpose (yes, this is purely subjective - but if a number of people agree that the naming makes sense, then that's the best that can be hoped for at this stage) 1. the term's usage is unambiguously defined in a short description

In the case of both standards-recommendations, the ramifications are that new fields will be added if the proposal is accepted, but if it is not accepted, any case in the Standards-specific usage patterns section that have no standard solution will remain ambiguous and/or unresolvable in the data.

Backwards compatibility

For true 'backwards compatibility', the newer (extended) version of records must conjur the same behaviour in consuming applications as the earlier version would have.

One assumption must be made in this regard: that the data consumer ignores unrecognised fields instead of calling a halt to the load process, or, at worst case, drops this particular record as 'an error'. Ideally, unrecognised fields would be ignored.

In most cases this is possible, however, in some instances where relationships between duplicate records are not clearly defined then multiple 'dots on the map' will appear. It is the hope that in the future some mechanism exists to mine for, seek validation and then remember these relationships - see Appendix: Relationship mining sub-system (The Relator) for more information.

Principal and parent

These are both pointers to core records (see About the data). They could be the DwC.occurrenceId, HISPID.Unit.UnitGUID or potentially a DwC.catalogNumber, HISPID.Accessions.AccessionNumber, or more!...

20111215 - TODO: add the likely standard term to each usage pattern for parent and principal...

The principal record

The principal is the record that is at the root of this and all other related records; the physical item that was "accessioned into a collection from the field" (eg. single vouchered specimen, still image, movie or sound recording, field notes, core sample, phial of multiple organisms, etc.).

The principal record may be associated with many specimen/sighting records, each with a unique identifier (eg. an image or sound recording containing determinations for many species) as well as derivative records such as gene sequences representing a part of the physical item (eg. leg of spider), or unique records representing different layers of a core sample, etc.

The value in this field will be the unique identifier of the principal, i.e. HISPID.Unit.UnitID, HISPID.Unit.UnitGUID,HISPID.Unit.Accessions.AccessionNumber, DwC.Occurrence.occurrenceId, DwC.Occurrence.catalogNumber, ...

The parent record

The parent is also potentially core data, but is included only if the supplementary data are more than one step removed, eg: a sequence of a dna sample of a specimen, or, a sighting of a digitised sound-track (a collection event), which came from an analogue sound reel (the collection item).

Relationship identifiers' source terms: principal and parent

Because the principal and parent can be one of a number of different identifieres, some additonal terms are needed to communicate where source data for principal and parent record exist in those record(s), i.e. under what term should the data consumer search if they wished to find this principal or parent record identifier?

This is important because it allows for relationships between different types of records (i.e. different bases; image:specimen, duplicate image:image...)

Recommended best practice is to use one of the standard dwc/hispid term names used to identify a record, e.g. HISPID.Unit.UnitID, HISPID.Unit.UnitGUID,HISPID.Unit.Accessions.AccessionNumber, DwC.Occurrence.occurrenceId, DwC.Occurrence.catalogNumber, ...

Types of relationships

This term communicates each type of relationship. Recommended best practice is to use the controlled vocab, e.g. image, sound, sighting, duplicate, donation, slide, dna, sequence, blah...

Topics for discussion are the controlled vocab, and whether it highlights 'copies of copies'.

Generic n-relationships logical/sub class: DwC.RelatedResources and HISPID.Unit.Associations

DwC: object-relationship-subject model; HISPID: parent-subordinate model;

desired attributes for each relationship :

Other attributes or entities relevant to collections

Collectors, sites, localities, collection events, field collections, surveys; in cases where this information is not an attribute of the record, these data are generally related to the others through one of the following methods:

Additional thoughts...

why have a principal and a parent

complex (deep) hierarchical relationships can be communicated and navigated quite effectively (reliably) with this two-level (root, parent) model, whereas there is a significant level of uncertainty associated with a 'one level up' hierarchy, especially across organisational boundaries; here's some bedtime reading on the topic...

principal, parent: why not imply an object-relationship-subject pattern (ie, bi-directional); instead, a subordinate relationship is implied

gut feeling is this becomes too complex, whereas it is the simple forms that are likely to use principal and parent... where desire exists to point in both directions within the one, it might be time to use the appropriate n-relationships class

Preparations versus basis of record


Changes to the FCIG-DwC extension

07/12/2011 NOTE: these are likely to evolve further

To FCIG-DwC; and after community consultation, ALA to propose to TDWG for ratification+inclusion in standard DwC :

  1. a mechanism to relate a General record to its principal as well as its parent records, one-to-one for a maximum of two relationships, in all forms of dwc (simple, archives and xml); 1. add principalRelResourceID to Record; 1. add principalRelResourceTerm to Record; 1. add principalRelType to Record; 1. add parentRelResourceID to Record; 1. add parentRelResourceTerm to Record; 1. add parentRelType to Record; 1. with the following still being agonised over (simple-dwc remember!)...
    • add principalRelID to Record;
    • add principalRelDate to Record;
    • add parentRelID to Record;
    • add parentRelDate to Record;
  2. dwc-archives and dwc-xml only, using DwC.ResourceRelationship to relate a record to other record(s), one-to-many (many-to-many?), n relationships: 1. add resourceIDTerm to ResourceRelationship 1. add relatedResourceIDTerm to ResourceRelationship

Consideration to the existing DwC standard

Some of these are viable mechanisms for relationship mining - see Appendix: Relationship mining sub-system (The Relator) for more information.

how should i uniquely identify my multimedia collection item then?

at first glance, a generalRecordID appears to be missing from dwc - but i reckon catalogNumber was meant to fulfil this role

the description 'an identifier (preferably unique) for the record within the data set or collection' seems to make far more sense if it lived under General, along with institutionCode, collectionCode, and the like.

in the case of specimen/sighting records, it seems that this term has become synonymous with an accession number - a multimedia record may well have an accession number and its own identifier... but it also has an occurrenceID

therefore, the advice i'm leaning towards at this stage is to :

same occurrenceID to relate records

same catalogNumber to relate records

associatedOccurrences to relate records

associatedTaxa to relate records

associatedMedia to relate records

associatedSequences to relate records

why add terms to Record?

so people who use simple-dwc can communicate record relationships

DwC.ResourceRelationship

looks like it covers most scenarios:


Changes to the HISPID specification

07/12/2011 NOTE: these are likely to evolve further

To HISPID, and after community consultation / trial of changes, proposed to TDWG for inclusion in ABCD.

  1. a mechanism relate a Unit record to its principal as well as its parent records, one-to-one for a maximum of two relationships, in all forms of hispid; 1. add principalUnitID to Unit; 1. add principalUnitTerm to Unit; 1. add principalRelType to Unit; 1. add parentUnitID to Unit; 1. add parentUnitTerm to Unit; 1. add parentRelType to Unit; 1. with the following still being agonised over (hispid-light remember!)...
    • add principalRelID to Unit;
    • add principalRelDate to Unit;
    • add parentRelID to Unit;
    • add parentRelDate to Unit;
  2. a mechanism to relate Unit-nested related records to the parent Unit explicitly in the data (allowing for these records to be communicated without nesting) : * sub-classes (Unit.Acquisition, Unit.Assemblages, Unit.Associations, Unit.Gathering.SiteImages, Unit.MultiMediaObjects, Unit.Sequences, SpecimenUnit.Preparations, SpecimenUnit.History.PreviousUnits, ...) must display the following attributes :
    • a UnitID/UnitGUID pointer;
    • DateLastEdited information; * consideration must also be given to relating these Unit-nested records to other Unit-nested records;
  3. hispid-xml only, ensuring HISPID.Unit.Associations can relate a record to other record(s), one-to-many, n relationships: 1. add AssociationDate; 1. add AssociationID; 1. confirm UnitGUID points to this unit (not that unit);

Standards-specific usage patterns

07/12/2011 NOTE: this section is still being pulled together

This section to provide use-cases and samples...

DarwinCore usage patterns

DwC synopsis for communicating relationships

<wiki:gadget url="http://hosting.gmodules.com/ig/gadgets/file/117808631063490062819/url.xml" up_Url="https://docs.google.com/spreadsheet/pub?hl=en_US&key=0AiNWJFdh4pHZdDY2cEgxdHZ5YlBoTUx4MGVfVTgyTGc&hl=en_US&gid=10" up_Title="matrix" up_height=1000 width="840" up_refresh=6000 />

From http://goo.gl/TdSLD

HISPID5 usage patterns

Specimen / Image Relationships with HISPID 5 :

HISPID synopsis for communicating relationships

<wiki:gadget url="http://hosting.gmodules.com/ig/gadgets/file/117808631063490062819/url.xml" up_Url="https://docs.google.com/spreadsheet/pub?hl=en_US&key=0AiNWJFdh4pHZdDY2cEgxdHZ5YlBoTUx4MGVfVTgyTGc&hl=en_US&gid=12" up_Title="matrix" up_height=1000 width="840" up_refresh=6000 />

From http://goo.gl/TdSLD

Some early HISPID analysis ...

This section looks at the problem from the hispid perspective; this is what i want to do in hispid - what cases does it apply to? Here is a crude glossary:

SpecimenUnit record with nested SiteImage record(s)

SpecimenUnit record with nested MultiMediaObject record(s)

MultiMediaObject/SiteImage record stands alone, from which sighting (specimen?) records are interpolated

HISPID would allow for this but each inferred SpecimenUnit record must have a nested MMO/SI record - in turn, each MMO/SI record would be a redundant copy of the other(s), with all the MMO/SI records sharing the same identifier.

An image collection item is treated like more traditional specimen(s); this image may be related to other more traditional specimen(s)

In this case, the image is the SpecimenUnit; it may be a duplicate, or derivation, of other SpecimenUnit(s); it has nested MultiMediaObject(s) to point to the image where it resides (usage pattern 1).

The image may be managed in a digital asset management system (DAMS) or it might be simply a file sent on the email from another institution. It may be desirable as the aggregator to keep both source records (the whole may be equal to or greater than the sum of the parts); however, in this case, the relationship between both records must be maintained.

SpecimenUnit record and MultiMediaObject / SiteImage record stand alone but share a 1:1 relationship

'separate storage mechanisms' imply some challenge in integrating the data; up until this point, it is assumed that the data provider will be able to build an aggregate SpecimenUnit record at the time of export, when they are communicating imagery associated with the specimen;

some types of the impediments to this 'just-in-time aggregation' that have been encountered so far are a lack of technical capability, a physical separation & a temporal separation;

MultiMediaObject or SiteImage records are nested under Unit (spec suggests that MMO/SI is part of unit); however, MMO/SI has no way of communicating explicitly that it belongs to !Unit - without this mechanism, each of these types of records could not exist in separate files, yet still maintain their relationship;

this usage pattern doesn’t appear to be supported by hispid as it stands; we would need to add a term to MMO/SI to point to its related !Unit - so it is anticipated that the main users of such a mechanism would have a need to communicate SpecimenUnit records completely separately to their image records, yet have a desire that these be related.


Appendices

Appendix: Relevance to different biodiversity domains

This document is relevant to a number of different groups sharing biodiversity information. The boundaries may be :

... see BioDomains on this wiki for more detailed information

Appendix: Transport (file) formats in the specifications

Simple dwc

csv darwincore

DwC archives

rough analogy to a star-schema:

DwC xml

full darwincore compliant xml file

HISPID light

standing proposal with hiscom to support a simpler file-format which is still hispid5-compliant (note: this is not hispid 3/4)

HerbariaDM

HISPID xml

valid hispid5 xml

Appendix: URNs as record identifiers

Uniform resource names (URNs) are textual data intended for use as identifiers of other data or services (resources), and display the following attributes:

Appendix: LSIDs

GJR's section

For the purposes of this document, LSIDs are synonymous with URNs??

Structure of, pattern rules

Basis of record contained within LSID

Historical mappings to earlier patterns

LSID responder (URN resolution)

For the purposes of this document, LSIDs are synonymous with URNs??

Intended behaviour

Output

Multiple matches to a LSID

Appendix: Searching for common patterns in the scenarios

The section About the data : Common patterns in the scenarios attempts to identify instances of common ground shared amongst the scenarios, i.e. a 'lowest common denominator', with the goal of relating each pattern to a prescribed usage within a particular standard.

A standing assumption: relationships between records are captured in the data management processes, and are evident in the source data.

A little about patterns

A pattern can be thought of as a generic method for solving a particular problem. It doesn't delve into specifics per se, rather, defines the requirements for a solution.

It is then up to the implementation to ensure the requirements are met within its own context (more info: http://en.wikipedia.org/wiki/Design_pattern%28computer_science%29_).

It could be said that the scenarios themselves are actually instances of a particular solution to a problem, each within a given context and/or domain. A solution exists within its source-system's boundaries, generating and/or capturing data relevant to its activities - this forms part of your operational data store.

Why bother identifying these patterns?

When it comes time to map these data to a standard form, we must ensure the relationship information continues to be discoverable in an explicit manner: at the business end, we need a way to handle each of the 'common denominators' in each of the data standards we are referring to in this document.

In addition to this, it is with some certainty that scenarios will arise that are not in the above list; however, a method for solving that problem will have already been defined.

Some of these methods will be applicable to the specifications as they stand, and others will require a currently non-standard solution. This is clearly defined in Standards-specific usage patterns.

More on relationships

In database parlance, the cardinality of a particular relationship is defined in a handful of ways: one-to-one, one-to-many (/many-to-one) and many-to-many (see http://en.wikipedia.org/wiki/Cardinality_%28data_modeling%29).

The right choice depends on the process and data being captured; but the method employed to communicate these will likely differ even within the same circumstance.

Appendix: Embedding a google docs spreadsheet in a wiki page

In your google docs spreadsheet: File menu -> Publish to the web; then click Start publishing; then scroll down to Get a link... and choose Web page from the drop-down menu; finally, copy the url from the text box - you will need this url in the wiki markup that follows...

In the wiki markup (customising the following code to suit your specific case)...

<wiki:gadget url="http://hosting.gmodules.com/ig/gadgets/file/117808631063490062819/url.xml" up_Url="https://docs.google.com/spreadsheet/pub?hl=en_US&hl=en_US&key=0AiNWJFdh4pHZdDY2cEgxdHZ5YlBoTUx4MGVfVTgyTGc&single=true&gid=0&output=html" up_Title="matrix" up_height=360 width="97%" up_refresh=6000 />

I noticed some funny business in firefox 7 with no-script under windows 7, which was resolved by 'allowing' the gmodules.com domain