Processing model - epimorphics/dclib GitHub Wiki

Basic processing model

context x template x CSV stream -> StreamRDF

Context

This provides:

element Notes
templates set of global templates that can be called by name
bindings set of default global variable bindings
prefixes default RDF prefix mapping
sources set of reconciliation sources that can be referred to by templates
parent parent context

In the command line tool there is no global context.

In a web service version we would expect a nest of contexts with default global context extended by a user- or project- specific context.

Template

The template can be a composite template which allows for variable bindings and in-line definition of other sources and templates.

See Template language

Processing

If there are any one-off templates these are run to general global data/datadata.

Then the input is processed one line at a time to generate a set of triple outputs. Hierarchical templates may retain state of last prior parent for each level of hierarchy.

Triples only at this point. No support for multiple named graphs from a single process.

The outputs can be streamed or aggregated into an RDF model.

Web service packaging

Not implemented

Web invocation:

  • supply converter names and key value pairs using query parameters
  • POST single csv file
  • POST single zip file containing source csv plus zero or more templates and auxiliary files
  • POST multi-part form data containing source csv plus zero or more auxiliary files

Return options:

  • directly returns converted RDF as Turtle (200) or returns human readable HTML page of conversion errors (400)
  • return location of an asynchronous status monitor through which the processing progress can be tracked and eventually provides a URL for download of the result

Command line packaging

java -jar dclib.jar [options] template.json  [aux-template.json ...] data.csv

Takes a template (which may be composite), zero or more auxilliary templates which can be referred to by name from the main template, and transforms the given csv file to generate a ttl result on stdout (or a set of error messages).

java -jar dclib.jar [options] [--nThreads 4] [--compress] --batch batchFile

Takes a batchFile which takes the form:

template1.yaml data1.csv
template2.yaml data2.csv
...

Each line specifies one conversion job and generates an output data1.ttl or data1.ttl.gz if compressed (or see below). The batches are started in separate parallel threads (up to a default of 4 but can be set by nThreads) which can improve processing time depending on the nature of the storage IO.

In both cases [options] can include:

Option Meaning
--debug Generate (verbose) debug information at each step
--streaming Stream the output, without this the data is generated in memory first and then written out.
--ntriples Output as N-Triples rather than Turtle, in batch mode output files will have suffix .nt or .nt.gz
--abortIfRowFails Force the conversion to fail if a conversion fails to find a template for some row