Multiple Source Data Resolution - ge-high-assurance/RACK GitHub Wiki

Problem Statement

Ingestion of RACK data (THINGs) for multiple sources results in the need for an automated method to resolve data from multiple sources into a single coherent model.

NOTE: THINGs is used generically here as it talks about the RACK data from multiple sources and THINGs is the base class from the ontology. While most conversations on this topic have centered around ENTITYs, the same issues could apply to an AGENT or ACTIVITY. As a result the common class of THING is used within this document.

Assumptions

  1. Source Material are typical “Engineering” type documents

    • THINGs being derived from document will have a locally unique identifier
    • The same THING referenced multiple times in the same source material then the same identifier is used.
  2. Source Materials with references to THINGs in different source materials will have clear associations between shared THINGs. Unclear associations are extremely rare and can be handled with special inputs/user actions.

    • A clear association is where a shared, unique, systematically defined association between THINGs can be identified

      • [SYS.REQ-123]\D
      • SYS.REQ-123
      • REQ-123
    • Unclear associations would be any difference that a shared, unique, systematically defined association between THINGs that should be related cannot be identified

      • Nav System IO
      • Input / Output
  3. The resolution association between source materials can be provided by the user at a class level.

    • The fact that HLR Doc has references to SLRs that are defined in the SLR Doc can be provided by the user. A unique relationship for each SLR in the HLR Doc to the associated SLR in the SLR Doc will not be provided.
  4. Source Materials must be allowed to be ingested one at a time.

  5. Order of Ingestion of source materials must be independent.

  6. Each THING will have a single prime source that will be identifiable.

CTQs

  • Fidelity - Ingested data needs to maintain all the information provided in the original source data.
  • Speed - Ingestion should be responsive (seconds to minutes, not hours) .
  • Scalable – Ingestion should be work on both large and small data sets.
  • Transparency - Users should be able to identify when ingestion was successful and any errors should be made apparent to the user with actionable error messages to help the user correction the error.

Use Cases

User ingests THINGs identified from a document (Doc1), this document references THINGs that are sourced from a different document (Doc2). Doc2 has been previously ingested and all references in Doc1 can be match to the THINGs in Doc2.

User ingests THINGs identified from a document (Doc1), this document references THINGs that are sourced from a different document (Doc2). Doc2 has been not been previously ingested and references in Doc1 are to not yet defined THINGs.

User ingests THINGs identified from a document (Doc1), this document references THINGs that are sourced from a different document (Doc2). Doc2 has been been previously ingested and some references in Doc1 can be matched to THINGs in Doc2; other references are not found and these references in Doc1 are to not yet defined THINGs.

User re-ingests THINGs identified from a document (Doc1), THINGs in this document are referred to by different document (Doc2). THINGs that are not in the new Doc1 are referenced by Doc2.

Approach

Requirements

A Candidate Approach: explicitly identifying external references

The approach shown below adds an extra class classifying entities which are known to refer outside of the thing being ingested at the moment. This class would provide a property to allow this external reference to be later resolved.

Notes:

  • It ensures that the fully power of the ontology is available to external references. All the properties that would normally be available on a definition are available to the external references. These will be useful for resolving the match later.
  • It ensures that we can identify which things in the data are references that still need to be resolved.
  • It preserves information about references generated by previous ingestion processes which could make it easier to resolve references if the ingestion process is re-run.
  • It preserves information used to resolve references in the data which could increase confidence in the resolution.
  • References are responsible for pointing to their definitions. Having a reference doesn't need to be a property of a definition. Details about entity resolution are isolated to a separate (e.g. EXTERN) class.
  • If a document is re-ingested all EXTERN references from other data could either replace or add additional definedBy edges to the new document being ingested.
  • definedBy is not transitive nor symmetric. It doesn't form chains. It doesn't declare a strong identity assertion like OWL:sameAs which implies all properties apply to all linked instances.
uri "http://glguy.net/extern-references" alias REFS version "1".

// These get defined first

THING is a class.

TREE is a type of THING
  described by parent with values of type TREE
  described by mentions with values of type THING.

EXTERN is a class
  described by definedBy with values of type THING.

// Thing A gets ingested on its own

A_1 is a TREE.
A_1.1 is a TREE with parent A_1 with mentions A_2. // internal mentions can be direct
A_1.2 is a TREE with parent A_1 with mentions A_REF_B_1.2. // external mentions use an EXTERN
A_2 is a TREE.
A_REF_B_1.2 is a TREE.
A_REF_B_1.2 is a EXTERN.

// Thing B gets ingested on its own

B_1 is a TREE.
B_1.1 is a TREE with parent B_1 with mentions B_REF_A_1.1. // cyclic references between A and B
B_1.2 is a TREE with parent B_1.
B_REF_A_1.1 is a TREE.
B_REF_A_1.1 is a EXTERN.

// Entity resolution links A and B together

A_REF_B_1.2 definedBy B_1.2.
B_REF_A_1.1 definedBy A_1.1.

This approach has an additional benefit that external definitions can form more complex graphs themselves. This extra information can better enable us to resolve external entities as we can capture expected relationships of the external things

In the example below we take advantage of a pattern seen in the Boeing data where external document references are declared at the top of the document. It would be easy to establish that all of a particular kind of item came from a particular externally defined but unresolved document.

uri "http://glguy.net/extern-references" alias REFS version "1".

// Minimal ontology

THING is a class.

DOCUMENT is a type of THING
  described by hadContent with values of type THING.

NODE is a type of THING
  described by mentions with values of type THING.

EXTERN is a class
  described by definedBy with values of type THING.

// Thing A gets ingested on its own

TEST_DOCUMENT_A is a DOCUMENT hadContent TEST_A.
TEST_A is a NODE mentions SOME_REQUIREMENT_1.
TEST_B is a NODE mentions SOME_REQUIREMENT_2.

// Document declares that it refers to a particular requirements document
// We can declare a subgraph of everything we know about the external fragments
// which could help us to resolve them later.
SOME_REQ_DOCUMENT_B is a DOCUMENT.
SOME_REQ_DOCUMENT_B is an EXTERN.
SOME_REQ_DOCUMENT_B hadContent SOME_REQUIREMENT_1 hadContent SOME_REQUIREMENT_2.
SOME_REQUIREMENT_1 is a NODE.
SOME_REQUIREMENT_1 is an EXTERN.
SOME_REQUIREMENT_2 is a NODE.
SOME_REQUIREMENT_2 is a EXTERN.

// Later document B gets ingested on its own

REQ_DOCUMENT_B is a DOCUMENT
  hadContent REQUIREMENT_1
  hadContent REQUIREMENT_2.
REQUIREMENT_1 is a NODE.
REQUIREMENT_2 is a NODE.

// Entity resolution links A and B together aided by the
//fact that both should come from the same document
SOME_REQUIREMENT_1 definedBy REQUIREMENT_1.
SOME_REQUIREMENT_2 definedBy REQUIREMENT_2.

A Candidate Approach: identifying external references via rule definition

The approach shown would vary as it not require any additional information to be included for each THING at ingestion (although that could be done if desired). Rather the user defines rule about certain aspects of the ontology and use semantic reasoning based on those rules to include additional information into the semantic model on at ingestion. This would also not preclude the used from just using the explicit definition of the referencing properties if desired. This same general approach could be used infer additional information, allowing for the creation of simpler queries. An example of this would be related to DO-178C objectives. You could have some rule-based inferences that would allows you to identify if an objective has been satisfied (i.e. all requirements have verification associated with them), you could also have a rule to determine if the verification was satisfied with independence. The benefit to using the inference is that you could have multiple discrete rules (i.e. one for testing, one for inspection, one for analysis) that could be "merged" into a single property (to indicate that the verification objective has been meet). I believe that SADL even has provisions for explaining how the rule was got applied, which may or may not be of use when trying to explain to a reviewer or auditor how you satisfied the objective, or to the development team to indicate why they have not yet met the objective.

I have been testing this out using SADL, but other ways of rule definition may be possible. The rule definition is the tricky part, but would get considerably easier if with sub-classing (i.e. HLR is a type of REQUIREMENT). I created separate files just to make it easier to see how the different parts would be broken up. First the Base ontology.

uri "http://MultipleSourceDataResolution/BaseOnt".
// These get defined first

THING is a class
    described by identifier with values of type string
    described by isPrimal with values of type boolean,
    described by definedBy with values of type THING.


TREE is a type of THING
  described by parent with values of type TREE
  described by mentions with values of type THING.

Next the two source files:

uri "http://MultipleSourceDataResolution/File1".
import "http://MultipleSourceDataResolution/BaseOnt".

A_1 is a TREE.
A_1.1 is a TREE with parent A_1 with mentions A_2. // internal mentions can be direct
A_1.2 is a TREE with parent A_1 with mentions A_REF_B_1.2.
/* external mentions use an item that is ultimately coming from a different
file but at this point it is just another TREE as far as the ingestion is concerned.*/
A_2 is a TREE.
A_REF_B_1.2 is a TREE.
uri "http://MultipleSourceDataResolution/File2".
import "http://MultipleSourceDataResolution/BaseOnt".

B_1 is a TREE.
B_1.1 is a TREE with parent B_1 with mentions B_REF_A_1.1. // cyclic references between A and B
B_1.2 is a TREE with parent B_1.
B_REF_A_1.1 is a TREE. // another reference back to an entity in the first file.

Finally the rule definitions:

uri "http://MultipleSourceDataResolution/IngestionRules".
import "http://MultipleSourceDataResolution/File1".
import "http://MultipleSourceDataResolution/File2".

/* The rule for identifying which TREE should be the Prime,
   in this case the fact that it is related to a parent TREE is used
   as the indication that this is the "Prime" tree. While not necessary
   the best this works for this demonstration*/

Rule Primacy:
    given
        t1 is a TREE and
        t2 is a TREE
    if
        parent of t1 is t2
    then
        isPrimal of t1 is true.

This inference can be shown by the following (and selecting SADL->Test Model)

Print:"======== Identify All Trees ========".
Ask:
    select t
    where
        t is a TREE .

Print:"======== Identify Prime Trees ========".
Ask:
    select t
    where
        t is a TREE and
        isPrimal of t is true.

the resulting output is:

======== Identify All Trees ========
Query: select ?t where {?t <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://MultipleSourceDataResolution/BaseOnt#TREE>}
     "t"
     "B_REF_A_1.1"
     "B_1.2"
     "B_1.1"
     "B_1"
     "A_REF_B_1.2"
     "A_2"
     "A_1.2"
     "A_1.1"
     "A_1"
======== Identify Prime Trees ========
Query: select ?t where {?t <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://MultipleSourceDataResolution/BaseOnt#TREE> . ?t <http://MultipleSourceDataResolution/BaseOnt#isPrimal> 'true'^^<http://www.w3.org/2001/XMLSchema#boolean>}
     "t"
     "B_1.2"
     "B_1.1"
     "A_1.2"
     "A_1.1"

While you can include "definedBy" as you did just including the relations as new explicit relationships (had to use a new file so I could import from both File1 and File2 with avoiding a circular reference)

uri "http://MultipleSourceDataResolution/ExpilictDefinitions".
import "http://MultipleSourceDataResolution/File1".
import "http://MultipleSourceDataResolution/File2".

A_REF_B_1.2 has definedBy B_1.2.
B_REF_A_1.1 has definedBy A_1.1.

I would rather use the identifier property to define a new rule. First identifier need to be created by appending identifier data to the File1 and File2.

A_1 has identifier "A_1".
A_1.1 has identifier "A_1.1".
A_1.2 has identifier "A_1.2".
A_2 has identifier "A2".
A_REF_B_1.2 has identifier "B1.2".
B_1 has identifier "B_1".
B_1.1 has identifier "B_1.1".
B_1.2 has identifier "B_1.2".
B_REF_A_1.1 has identifier "A_1.2".

Now this identifier data can be used to create an inference rule for "definedBy" property in the ingestion rules file.

Rule DefinedBy:
    given
        t1 is a TREE and
        t2 is a TREE
    if
        identifier of t1 is identifier of t2 and
        isPrimal of t1 is true
    then
        definedBy of t2 is t1.

Print:"======== Identify Related Trees ========".
Ask:
    select Reference, Original
    where
        Reference is a TREE and
        Original is a TREE and
        Reference != Original and
        definedBy of Reference is Original.

This query now results in the finding of the correct relationships

======== Identify Related Trees ========
Query: select ?Reference ?Original where {?Reference <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://MultipleSourceDataResolution/BaseOnt#TREE> . ?Original <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://MultipleSourceDataResolution/BaseOnt#TREE> . ?Reference <http://MultipleSourceDataResolution/BaseOnt#definedBy> ?Original . FILTER (?Reference != ?Original)}
     "Reference","Original"
     "B_REF_A_1.1","A_1.2"
     "A_REF_B_1.2","B_1.2"

My though with this is you could also add ingestion reports to find things like a unresolved references: lets say for example you have a slight change in your identifer for one of the reference TREEs:

B_REF_A_1.1 has identifier "A1.2".

This now is not found with the results from the query before:

======== Identify Related Trees ========
Query: select ?Reference ?Original where {?Reference <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://MultipleSourceDataResolution/BaseOnt#TREE> . ?Original <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://MultipleSourceDataResolution/BaseOnt#TREE> . ?Reference <http://MultipleSourceDataResolution/BaseOnt#definedBy> ?Original . FILTER (?Reference != ?Original)}
     "Reference","Original"
     "A_REF_B_1.2","B_1.2"

You should also be able to write a query for finding unmatched references:

Print:"======== Identify Unresolved Trees ========".
Ask:
    select Reference
    where
        Reference is a TREE and
        isPrimal of Reference is false and
        definedBy of Reference is not known.

This should result in the following results:

======== Identify Unresolved Trees ========
Query: select ?Reference where {?Reference <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://MultipleSourceDataResolution/BaseOnt#TREE>
. ?Reference <http://MultipleSourceDataResolution/BaseOnt#isPrimal> 'false'^^<http://www.w3.org/2001/XMLSchema#boolean>
. OPTIONAL {?Reference <http://MultipleSourceDataResolution/BaseOnt#definedBy> ?v1} . FILTER (!bound(?v1))}
     "Reference"
     "B_REF_A_1.1"
     "B_1"
     "A_2"
     "A_1"

Note there is a SADL bug for translating this query, but this can be corrected by using a corrected sparql:

Ask:"select ?Reference where {?Reference <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://MultipleSourceDataResolution/BaseOnt#TREE>
. ?Reference <http://MultipleSourceDataResolution/BaseOnt#isPrimal> 'false'^^<http://www.w3.org/2001/XMLSchema#boolean>
. OPTIONAL {?Reference <http://MultipleSourceDataResolution/BaseOnt#definedBy> ?v1} . FILTER (!bound(?v1))}".