Ingestion Type Handling - ge-semtk/semtk GitHub Wiki

The ingestion process reads strings from a data source such as CSV or ODBC, then maps, combines and transforms them into values for a SPARQL INSERT statement. Two types of checks are done at this stage:

type checking
values needed for URILookup must not be empty

This page explains how the final values are converted to the proper type as specified by the model, and how type-checking is performed.

Data conversion is focused on URIs and W3C xsd primitive data types. The following describe how each is parsed and ingested.

Strings

Strings are handled in conformance with CSV spec with an attempt to handle newlines in strings in a way compatible with Excel. (Excel has been observed to handle embedded newlines in a hard-to-specify way when CRLF characters are not as it expects. Likewise for "" sequences that are not inside a double-quoted cell.)

Below is a sample CSV which demonstrates many basic principles of escaping strings in CSV:

line 1 is the column headers
line 2 column 1: shows quoted string with escaped quotes "" and comma inside the quotes
line 2 column 2: shows one embedded line return, plus the escape sequences \n and \\n
line 3: shows embedded tab and escape sequence \t, and the use of unquoted strings which contain no commas nor quotes

str1,str2
"notepad what is ""this,"" here","notepad line one
backslash-n-follows\nrest2 double-backslash-n-follows\\nrest3"
tab	tab,back-t\tback-t

After using the excel-CSV rules to read a string, SemTK performs these additional escaping rules in order to satisfy the W3C String Literal Spec

single and double-quotes, line feeds and carriage returns are escaped
any backslash is escaped UNLESS it is a \u0000 or \U00000000 style unicode escape sequence

Querying strings back through SemTK will then produce a string CSV that is equivalent to the one ingested.

Unicode sequences, however, are typically converted by the triplestore into the actual character. So ingesting "\u0048\u0065\u006c\u006c\u006f World" and querying it back out would produce "Hello World".

Note that non-ASCII characters and invalid unicode escape sequences in your input files 
may cause ingestion failures and/or strings that appear to print identically 
even though they are not equal.

Ingestion transforms or custom tools may be needed to clean up input files.

Example strings used for Junit testing can be found in testStrings.csv and loadTestDuraBatteryExcelDescData.csv

URIs

Valid URIs

To be ingested as a URI, a string must pass these validations:

the Java URL(uri) constructor must succeed without throwing an exception
the first character of the local fragment must match [a-zA-Z0-9]

Prefixing

First note that an ingestion template may use a text field ending in "#" to add a prefix to ingested data.

After the entire URI value is built, if it contains no prefix (i.e. if the URI contains neither of: "#", "://")

Enumerated - If a URI is being assigned to an instance of a Class that is specified by "owl:oneOf" (SADL "must be one of"), then input strings may be either the entire URL or case-sensitive local prefix. Local prefixes will be changed to the matching full URI by the ingestion service.
BaseURI - otherwise if the "Base URI" field is specified, it is prepended along with a "#"
Default - otherwise "http://semtk.research.ge.com/generated#" will be prepended.

xsd:date

During ingestion, all xsd:date values are translated into a SPARQL INSERT query using the [ISO_LOCAL_DATE_TIME] (https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_LOCAL_TIME) format: "10:15:30".

The following formatters are tried in order, using Java LocalDate.parse(date, formatter):

"MM/dd/yyyy"
"MM-dd-yyyy"
"yyyy-MM-dd"
"dd-MMM-yyyy", case insensitive (e.g. 12-Jun-2008 or 12-JUN-2008)

xsd:dateTime

During ingestion, all xsd:dateTime values are translated into a SPARQL INSERT query using the [ISO_OFFSET_DATE_TIME] (https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_OFFSET_DATE_TIME) format: "2011-12-03T10:15:30+01:00".

To ingest dateTimes with timezone

"yyyy-MM-ddTHH:mm:ss+01:00" - ISO_OFFSET_DATE_TIME, e.g. "2020-03-23T23:59:59-4:00"(https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_OFFSET_DATE_TIME)
"EEE MMM dd HH:mm:ss zzz yyyy" - produced by SADL, e.g. "Wed Mar 22 20:00:00 EST 2017"

To ingest dateTimes without timezone

The following formatters are then tried in order, using Java LocalDateTime.parse(date, formatter):

"MM/dd/yyyy HH:mm:ss"
"MM-dd-yyyy HH:mm:ss"
"yyyy/MM/dd HH:mm:ss"
"yyyy-MM-dd HH:mm:ss"
"dd-MMM-yyyy HH:mm:ss", case insensitive (e.g. 12-Jun-2008 05:00:00 or 12-JUN-2008 05:00:00)

To ingest plain date as a dateTime

All xsd:date formats are tried next. If parsed successfully, the date is inserted as a dateTime with no timezone and hours, minutes, seconds set to 0.

Non-date primitive types

boolean:

Java Boolean.parseBoolean

decimal, double:

Java Double.parseDouble

duration:

Java Duration.parse

float:

Java Float.parseFloat

int, integer, negativeInteger, nonNegativeInteger, positiveInteger, nonPositiveInteger:

Java Integer.parseInt

long:

Java Boolean.parseBoolean

unsignedByte:

Java Byte.parseByte

unsignedInt:

Java Integer.parseUnsignedInt

Simpler date and time related types

time:

Java Time.parse

gYearMonth:

"YYYY-MM"

gYear:

"YYYY"

gMonthDay:

"MM-dd"

Creating Literals

Guided by : W3C Matching Literals

During ingestion, Semtk creates literals following RDF1.1.

strings are quoted and untyped: "example"
numeric values are untyped and unquoted: 42
boolean values are unquoted: true or false
dates and times are quoted and typed: "2012-02-02T02:00:00"^^XMLSchema:dateTime

These forms above should be also used in FILTER statements, but fully qualified strings will also match:

string, e.g. "example"^^http://www.w3.org/2001/XMLSchema#string
numbers, e.g. "42"^^http://www.w3.org/2001/XMLSchema#integer
booleans, e.g. "false"^^http://www.w3.org/2001/XMLSchema#boolean