Ingest by Class Template - ge-semtk/semtk GitHub Wiki

SemTK will create ingestion templates for each class that support a simplified and standardized way of ingesting data one-class-at-a-time. The advantage of this strategy is that no ingestion templates need to be manually created. On the other hand, it is far less flexible.

In general, ingestion by class templates require:

  1. that classes have a data property with a unique identifier (the default is a data property matching the regex "identifier")
  2. the ingestion process is organized to make sure instances are ingested before they are referenced as an object of a triple by another class's ingestion

In SPARQLgraph, ingestion by class templated is accessed through the Nodegroup->get class template menu.

Parameters

idRegex - is a regular expression that should match a data property that uniquely identifies an instance of the class. the default is "identifier" dataClassRegex - is a regular express that should match classes which are treated as data

These make more sense as the template-building algorithm is explained.

Template building algorithm

The general idea is to automatically generate a nodegroup to ingest every property of a class.

  • the main class is looked up by a property matching idRegex with the mode CREATE_IF_MISSING
  • data properties will have a column with their keyname, e.g. "birthdate"
  • object properties will have a column with the keyname + "_" + the property used to identify the instance to be connected. e.g. father_name where the class has an object property "father" and idRegex matches "name" where name uniquely identifies instances of Person
  • all CSV fields have ignore spaces and the word "null"

When ingesting object properties:

  • the instance is looked up with the mode ERROR_IF_MISSING
  • the class of the instance being connected must contain a property that matches idRegex

In order to avoid problems with objects not yet existing, it is common practice to ingest all objects in a first pass, with CSV's that only contain the identifier properties of each instance. In a second pass, the full CSV's are ingested

Enumerated classes

Enumerated classes are treated separately:

  • they are always treated as data. e.g. if Person has an object property "favoriteColor" with a range of the enumerated class "Color", the ingestion template would have the column "favoriteColor_Color"
  • if they also optionally have an identifer property matching idRegex, enumerated classes will also have an input column to look up by identifier.

If an enumerated class also has an identifier for lookup, the csv template will contain columns for both the enumerated class URI and the id lookup. Either or both may be used.

Consult these wiki sections of the Ingesting CSV Data page:

dataClassRegex parameter details

Classes that match dataClassRegex are treated as data

  • they do not need to contain a property matching idRegex
  • they are never looked up, always created
  • each data property has its own column in the ingestion CSV. e.g. if Measurement has data properties val and unit, Mesaurement matches dataClassRegex, Part has thickness with values of type Measurement, then the csv template will contain the columns "thickness_val" and "thickness_unit"
  • object properties of classes matching dataClassRegex are ignored unless they are enumerated

In the complex case where unit is an object property whose range is the enumerated class Unit, the column name would be "thickness_unit_UNIT".

example

In this simple model, Color is an enumerated type: image

Pull the nodegroup->get class template menu, and request the class template for DuraBattery, setting "batteryId" as the identifier, and for the sake of example making Cell a data class.

image

The resulting nodegroup has Battery with all four Cell branches, two of which are shown here for clarity: image

And the ingestion template looks like this: image

Summary description:

  • Battery is looked up by "batteryId" CREATE_IF_MISSING
  • Battery's data properties "assemblyDate" and "batteryDesc" get CSV columns that match their names
  • Cell is treated as data in this example, so we get the columns "cell1_cellId", "cell2_cellId"... which create cells, link them to Battery via cell1 through cell4, and populate their "cellId" fields.
  • since Cell's object property Color points to an enumerated type, Colors are connected using cell1_Color, cell2_Color ...