Processing model - epimorphics/dclib GitHub Wiki
Basic processing model
context x template x CSV stream -> StreamRDF
Context
This provides:
element | Notes |
---|---|
templates | set of global templates that can be called by name |
bindings | set of default global variable bindings |
prefixes | default RDF prefix mapping |
sources | set of reconciliation sources that can be referred to by templates |
parent | parent context |
In the command line tool there is no global context.
In a web service version we would expect a nest of contexts with default global context extended by a user- or project- specific context.
Template
The template can be a composite template which allows for variable bindings and in-line definition of other sources and templates.
Processing
If there are any one-off templates these are run to general global data/datadata.
Then the input is processed one line at a time to generate a set of triple outputs. Hierarchical templates may retain state of last prior parent for each level of hierarchy.
Triples only at this point. No support for multiple named graphs from a single process.
The outputs can be streamed or aggregated into an RDF model.
Web service packaging
Not implemented
Web invocation:
- supply converter names and key value pairs using query parameters
- POST single csv file
- POST single zip file containing source csv plus zero or more templates and auxiliary files
- POST multi-part form data containing source csv plus zero or more auxiliary files
Return options:
- directly returns converted RDF as Turtle (200) or returns human readable HTML page of conversion errors (400)
- return location of an asynchronous status monitor through which the processing progress can be tracked and eventually provides a URL for download of the result
Command line packaging
java -jar dclib.jar [options] template.json [aux-template.json ...] data.csv
Takes a template (which may be composite), zero or more auxilliary templates which can be referred to by name from the main template, and transforms the given csv file to generate a ttl result on stdout (or a set of error messages).
java -jar dclib.jar [options] [--nThreads 4] [--compress] --batch batchFile
Takes a batchFile which takes the form:
template1.yaml data1.csv
template2.yaml data2.csv
...
Each line specifies one conversion job and generates an output data1.ttl
or data1.ttl.gz
if compressed (or see below). The batches are started in separate parallel threads (up to a default of 4 but can be set by nThreads
) which can improve processing time depending on the nature of the storage IO.
In both cases [options]
can include:
Option | Meaning |
---|---|
--debug |
Generate (verbose) debug information at each step |
--streaming |
Stream the output, without this the data is generated in memory first and then written out. |
--ntriples |
Output as N-Triples rather than Turtle, in batch mode output files will have suffix .nt or .nt.gz |
--abortIfRowFails |
Force the conversion to fail if a conversion fails to find a template for some row |