Ingesting CSV Data - ge-semtk/semtk GitHub Wiki

The Import Process

Importing starts with building a nodegroup on the SPARQLgraph Query tab to represent the graph pattern to be imported. Then a sample CSV file can be dragged onto the Map Input tab, and the CSV columns can be mapped to elements in the nodegroup. Data validation and transformations, as well as URI lookup logic can be specified. Once complete, this mapping becomes part of the nodegroup.

An import starts with a pre-check pass through the data:

The process proceeds with the ingestion pass:

  • Pruning of the nodegroup: Before ingesting a record, the nodegroup is pruned such that nodes are removed iff:

    • the node has no URI generated by this record
    • none of the node's properties have values generated by this record
    • the node is not connecting two un-prunable nodes

    Remember, that after pruning, if a node survives with an empty URI, a GUID will be generated.

  • rows of data are processed independently, each row of data will generate the triples required to form a copy of the nodegroup. For ingestion, the nodegroup has all optionals and constraints stripped out.

    • data properties and URIs are filled with mapped values

    • object property connections are built

    • pruning removes any sub-graphs in the nodegroup that have no data in them

  • the triples are ingested in parallelized batches.

The Basics of Mapping

The SPARQLgraph Import Map tab provides tools to build a mapping between a tablular dataset like .CSV and the nodegroup. This mapping can be used to ingest data. It is saved with the nodegroup.

The left side of the screen shows each item in the nodegroup, along with a box which can contain a list of mapping items.

  • node names - bolded rows represent the URI that will be assigned to a node's name.

  • properties - non-bolded rows are the node's properties. Normally, these are integers, strings, dates, etc. whose types fall outside the 'domain' set up for your SPARQL connection, and are treated as DataProperties.

  • node connections - are not shown. They will be created automatically, and then subjected to the pruning process described below.

The right side of the screen lists mapping items which can be dragged over to the left:

  • column names - the list is generated by dragging a .csv to the target area at the top. Column mapping items will resolve to the value of this column in the input data

  • transforms - can be dropped on a column name to perform a transformation

  • text items - plain text

To build an import mapping, drag columns into one or more of the mapping rows on the left. Columns can be interspersed with text items. Columns can also have transforms applied to them to clean up incoming data.

Data Validation

Data validation is performed in its own first pass over the input data. This occurs before any ingestion begins. If any errors are found, the entire ingestion process is halted and a table of errors is provided.

Data validation is configured using the check box buttons to the left of the column names.

For all validations, column names are type insensitive. Further, leading and trailing whitespace is stripped from all column names and cell values before validation.

For regular expressions note that they apply to the entire cell. They are applied with the Java match() function, and not the find() function. Try this Java regex tester and documentation page.

The following data validations are available:

  • mustExist - column must be present.
  • notEmpty - the column must exist AND always contain a value.
  • regexMatches - if non-empty, the entire cell must match this regular expression after whitespace trimming.
  • regexNoMatch - if non-empty, the entire cell must not match this regular expression after whitespace trimming.
  • type - if non-empty, the cell must be one of these types:
    • int
    • float
    • date
    • datetime
    • time
  • lt, gt, lte, gte, ne - if non-empty, the cell must be < > <= >= or != the given value. These operations are only available after a type has been specified.
Note that an empty/missing cell will PASS any validation except **mustExist** or **nonEmpty**.

Parsing parameters of arithmetic validations

The parameter values provided for arithmetic operations must match the given type. For date, time, and datetime formats these must be ISO form but without timezones. Examples include:

  • 2020-05-20T16:31:19
  • 2020-01-20
  • 16:31:19

In contrast, the format of the types that must appear in the input data (and those which are checked during ingestion) both use the much broader type checking described in ingestion type handling.

Mapping URI's

URI's are constructed using additional rules

For non-enumerated ("normal case") URI's:

  • if the import map blank, a UUID will be generated AFTER the node survives pruning. This is the preferred method for generating URI's.
  • if not blank, the import specification is processed as usual except for empty inputs.
  1. If all input columns are empty, the import map for this URI is considered empty regardless of whether text items occur. The URI will be processed just like a blank mapping, and assigned a GUID if it survives pruning.

  2. If some, but not all, columns are empty then an error is generated to catch this potentially dangerous URI collision scenario

enumerations

If the node's class is subject to owl:oneOf, the import specification must resolve to a valid value, either:

  • the full URI of a valid value of the enumerated type
  • an un-prefixed fragment that matches a valid value. In this case the ingestion engine will use the full prefixed URI.

Empty columns do not cause special error handling. They simple resolve to no text before the above rules are applied.

Looking up URI's

The default ingestion behavior is to create a new instance of each object being ingested. If the object's URI line in the ingestion template is empty, it defaults to --Generate UUID--

Alternatively URI's can be looked up using one or more keys in the input data.

To look up a URI, start by finding one or more data property items that you want to use to lookup up the object, and click the URI lookup button:

The resulting dialog will allow you to choose the item to be looked up. In this example ?battery_id is used to lookup ?DuraBattery

Now the URI lookup buttons for both ?battery_id and ?DuraBattery will be highlighted. Next click ?DuraBattery's URI lookup button. The resulting dialog shows:

  • ?DuraBattery is not being used to lookup any other URI's
  • ?DuraBattery is being looked up by ?battery_id
  • a lookup mode can be chosen

Lookup modes:

  • Error if Missing - if lookup succeeds, use the resulting object. Otherwise the entire ingestion fails.
  • Create if Missing - if lookup succeeds, use the resulting object. Otherwise create a new object.
  • Error if Exists - if the lookup succeeds, the entire ingestion fails. Otherwise create a new object.

The following apply:

  • multiple items from anywhere in the nodegroup may be used to look up a URI
  • a URI will be found if all the links to all of the lookup items match the values mapped from the input data
  • a lookup will fail if any of the lookup items evaluate to empty for any row of ingestion data, unless the URI being looked up is subsequently removed during "Pruning of the nodegroup" before the row is ingested

Under the hood, a lookup is performed with the following process:

  • create a copy of the nodegroup
  • set the target URI as the return value
  • map the csv data to a value for each lookup value, and constrain that item in the nodegroup to match
  • prune the nodegroup
  • generate and execute the select query
  • fail if multiple values are found. For zero or one, see the "Lookup mode" to determine the behavior

After a successful lookup, the URI which was found is used when the row of data is ingested.

Enumerated classes

If a class is enumerated, the csv input data may be only the final fragment of the URI or the entire URI.

Lookup and node URI used on same node

If a node is looked up in CREATE_IF_MISSING mode, and it also contains a mapping in the node URI row, the lookup is attempted first. If the lookup succeeds, the node URI mapping is ignored. This could mean the node has a different URI than expected in the ingestion template.

Important notes on URI lookup

All URI lookup is performed during the first pre-ingestion validation pass through the data. This pass handles:

  • checking the existence of URIs
  • throwing error-if-missing, or error-if-exists
  • creating new URIs on create-if-missing

If a URI is looked up from the same mapping in multiple rows of data that have the same values in all of the lookup fields, they will all end up referring to a single new URI.

Likewise, if the ingestion template contains two nodes with the same class, and each looks up the URI using the same logic, then lookups from the same mapping in different columns the URI will still be matched or created as a single URI.

Warning: 'create if missing' in multiple columns with URI lookup using different criteria

If the same class is looked up in mode 'create if missing' from different columns using different criteria, an ingestion warning is generated. In this situation, SemTK can not guarantee that it will not incorrectly create a duplicate (which should have matched both lookup criteria). You may choose to carefully review the ingestion template and continue. When in doubt, it is safer to split your ingestion process into two steps such that this condition does not occur.

Mapping Sub-types

If a class has sub-classes, the ingestion template will show a _type row, as seen for Battery here:

This row may be mapped such that it's value is a subType* of the class (in this case, of Battery). The value may be short-hand (e.g. "DuraBattery") if that resolves to a unique class. It may also be the full value (e.g. "http://kdl.ge.com/durabattery#DuraBattery"). The instance created by the given row of ingestion data will have the specified subtype instead of the superclass in the nodegroup.

If used for a URI Lookup, the default behavior is to lookup and match an item of this subclass (including subclasses of this subclass!). If the value of the mapping ends in an exclamation point, the lookup will only match the exact subtype specified.

Example csv might look like this:

sub-type, id
Battery!,MyBatteryId

The exclamation point has no effect on an instance that is created; it is only used during lookup to specify an exact match.

Other behaviors:

  • an empty subtype mapping, or mapping data will have no effect
  • otherwise, an error will be generated if a subtype mapping does not map to a valid subType* of the given nodegroup node

Mapping Properties

Non-bolded rows are properties.

  • if the mapping line is blank, the no value will be ingested for the property for any record
  • if the mapping line resolves to blank, no value will be ingested for the property for this record

Text and Transforms

Text Items

Text may be created and dragged into a row in any position.

Note the special case where the first text field in a URI row contains "#".  

This will become the URI prefix.

Transforms

Transforms may be dropped onto individual columns in order to clean up input data.

Hash

Applies SHA-1 hash.

This is meant to allow long text fields to be converted into a legal URI in a repeatable way such that different loads might access the same URI.

A 'z' is prepended to the hash as a convenience such that the resulting string will always begin with a letter and can be used as a URI.

Replace All

Applies Java String.replaceAll(arg1, arg2)

To Uppercase / To Lowercase

Applies the Java String.toUpperCase() or String.toLowerCase() function, respectively

Final type validation

The values assembled for ingestion are type-checked against the model. See ingestion type handling.

Ingestion warnings

An ingestion process can produce warnings if any of the conditions below are met. Depending upon how the ingestion process is launched, the warnings may be suppressed. Warning criteria are:

  • missing column - the CSV is missing a column in the ingestion template. As long as the column is not required for a URI lookup, this condition is not an error. It sometimes indicates a misspelling or other mistake in the CSV. It is often a normal and useful condition.
  • extra column - the CSV has an extra column that is not ingested. This may indicate a misspelling or other mistake in the CSV. It is often a normal condition. It is never an error.
  • create if missing in multiple columns... - the same class is looked up 'create if missing' in different places with different lookup criteria. This could result in unexpected duplicates being created if the lookup criteria could point to the same instance. It is recommended that the ingestion process be split into additional steps to avoid this condition, but it may be ignored if desired.
⚠️ **GitHub.com Fallback** ⚠️