Combine Entities - ge-semtk/semtk GitHub Wiki

SemTK provides support for combining entities as part of an overall entity-resolution pipeline. The logic for deciding which entities are equal is left to domain-specific tools. SemTK provides Combine Entities functionality to:

  1. store relationships showing identical entities
  2. combine the entities

Storing SameAs relationships

SemTK provides a simple model in EntityResolution.sadl and EntityResolution.owl.

An instance of the SameAs class can be ingested to declare two other instances the same. It contains the properties:

  • target which indicates the "main" instance
  • duplicate which indicates a "duplicate" instance

An application that performs entity resolution may ingest instances of SameAs using normal SemTK ingestion tools. The class may be extended with a subclass containing additional information. Be aware that this will be deleted during the combining process. The properties may be extended to sub-properties for clarity.

SameAs rules and chaining

An error occurs if a target is not of a type that is a subclass* of the duplicate instance's type.

SameAs relationships may be chained, but these are considered errors:

  • An object is a duplicate to two targets
  • A chain of SameAs relationships is circular
  • Cardinality violations where a SameAs instance does not have exactly 1 target and 1 duplicate

Combine Entities

Combining entities is currently accessed through semtk-python3.combine_entities_in_conn() or the REST API for /nodeGroupExecution/dispatchCombineEntitiesInConn

Combining entities occurs in passes, where each pass consists of all SameAs whose duplicate is not also a target of another SameAs. The passes continue until no more SameAs meeting this criteria are found. If any additional SameAs still exist, they will be reported as an error, as they must violate cardinality or chaining rules.

For each SameAs the combination process is:

  1. Delete the duplicate instance's type relationships
  2. Delete any triple from duplicate where adding it to target would violate a cardinality constraint. e.g. if the class has property "name" with cardinality 1 and both duplicate and target have names, the duplicate's name is deleted and the target is retained
  3. Copy all remaining triples where the subject or object is duplicate such that the subject or object is now target, e.g. any triple that only occurs for one or the other of target and duplicate, or where multiple triples with the same predicate are allowed
  4. Delete those triples where the subject or object is duplicate
  5. Delete any triples containing the SameAs instance as the subject

Example

Consider this example of two entities connected by a SameAs, and consider that:

  • cardinality of identifier is one
  • SubDD_Req is a subclass of REQUIREMENT

Before

The duplicate version of the instance was created with a base class REQUIREMENT and an identifier string that uses different punctuation , abbreviations, etc.

After

After combining entities, there is a single instance of the SubDD_Req subclass:

  • all three dataInsertedBy relationships are retained
  • the wasImpactedBy relationship is moved to the target instance
  • the duplicate identifier is removed, as combining it would violate the cardinality of 1
  • the duplicate (super)type is remove
  • the SameAs instance is removed