RDFizer

As part of the main goal of the ALIADA tool, RDFizer, as the name suggests, is a conversion / translation engine. It supports several (input) formats and allows to translate those kind of data in RDF statements. The main component of the RDFizer is an asynchronous pipeline that divides the translation process in several steps (e.g. input validation, format detection, stream split, conversion, indexing) by means of Apache Camel [5].

At the moment asynchronous channels are implemented completely in memory (using seda or direct endpoints). However, thanks to Apache Camel and its high degree of modularity, it is very easy to move towards a more scalable and reliable implementation, using a MOM for example (e.g. ActiveMQ): that is just a matter of configuration.

The module artifact itself is a web application, which manages the pipeline and offers a REST interface (see below) for creating, monitoring and managing jobs. For technical information (e.g. how it is implemented, how to expand the RDFizer with additional formats) please see the Developer Guide.

Supported formats and ontologies

The overall project started with three kind of target data: LIDO, MARCXML and DC. For information about their support and the current status of the project see the Roadmap page. The RDFizer has been built with extensibility in mind, meaning that, for a developer, it is relatively easy to add a new format and to define a new kind of translation based on a different ontology. For this kind of information, see the Developer pages.

Dublin Core

The DC format has been used in ALIADA tool as a reference for converting data coming from one of the project partener, which is a Museum. DC data has been exported from Drupal, which is used as Content Management System in that institution. In general the mapping is easier than the other formats, as the DC does have a simple (moreless flat) structure.

LIDO

The LIDO format has been used in ALIADA tool as a reference for converting data coming from Museums. Within the project source code we included a sample LIDO record [1] coming from MFAB. The current translation engine makes use of templates that translate LIDO data in RDF using the CRM-CIDOC [6] part of the EFRBROO [3] ontology. Other than that sample LIDO record, just beside to that, you can see the corresponding output in RDF [2].

MARCXML AND FRBR

MARCXML is the xml version of the standard bibliographic format and it is used to translate, as you imagine, bibliographic records. The choosen ontology, EFRBROO [3], due to implicit FRBR specs [4], implies a more complex conversion process: before starting the conversion job, a previous FRBR entity detection is needed; specifically, starting from a given MARCXML record, the system tries to detect the following FRBR entities:

Works, Expressions, Manifestation;
Persons, Organizations, Groups;
Themes

Item entities, which are part of the first group, haven't been considered at the moment, since they represents local data and therefore not so important for a linked data catalog.

At the end of the entity detection process, for each detected entity, the system assigns a given URI. With that set of URIs the conversion process can happen, therefore applying the EFRBROO ontology.

How does it works?

The conversion pipeline is composed by several steps, each of them tries to execute a little piece of the overall translation job.

Input data validation

The input data must be available on a file, and that file needs to be checked in order to make sure a set of formal preconditions are met. Specifically, the file must exists and must have read and write permissions (the system will need to move the file during the conversion process). In addition, once invoked (with an identifier), the newJob service will look for a database row in rdfizer_job_instances table. That row is supposed to contain all information needed for a proper translation.

Records count

The incoming stream needs to be "sized". This is not strictly important for the conversion itself but for monitoring, in order to communicate and understand the progress percentage of the overall job. So this is the first important phase of the conversion process. Without this information, the RDFizer cannot know the progression of the overall job and therefore it is not able to understand when a conversion has been completed.

RDF Validation

Although the Semantic Web follows the AAA principle (Anyone can say Anything on Any topic), the module offers a complementary validation service, that checks a formal adherence to the ALIADA ontology of the produced triples. In case one ore more violations are raised, the job is stopped and the validation messages are stored within the job definition. In this way, a requestor can see and analyze those messages using the getJob REST service (described below).

Split

Assuming that an input file, regardless the format, will include more than one record, this step will split the incoming stream by producing several chunks of it; each chunk corresponds to a record that will be processed in parallel.

FRBR Entity Detection (MARCXML records only)

As briefly explained, this step, valid only for MARCML records, detects the FRBR entities and assign them an identity (URI). The entity detection rules can be found within the ALIADA configuration files. Each entity has been associated with one or more "extractors" that are in charge to select values from the bibliografic record and use them for identifying that entity. Keep in mind that if, at the end of this process, for a given record, it hasn't been possible to detect at least one entity of the first FRBR group, the record will be discarded. This event is printed out in the RDFizer log file.

Conversion

By means of a conversion template, associated to a given record by the engine, the conversion happens. The output of this step is a set of RDF triples in N3 format. ALIADA uses Apache Velocity as a templating language for creating conversion templates; it is easy, fast and it allowed us to create a kind of internal DSL; that would make possible an easy extension of the overall templating engine (if for example, you may want to manage different MARC dialects).

NER

As part of the conversion process, we integrated (in release 2.0) a Named Entities Extractor, which detects a set of entities from the unstructured fields of the input records (e.g. a note field on a MARC record). The extracted information (i.e. entities) are then converted in RDF following the ALIADA ontology. This is just an example:

Darwin rdf:type skos:Concept 
PersonsCollection skos:member Darwin 
E19_Physical_Object1 ecrm:P137_exemplifies Darwin
Darwin skos:prefLabel "Darwing" .

The integrated NER engine comes from the Stanford NLP Project [8]. It can detect, depending on the configured classifiers, to detect several entities like Persons, Organization, Places, Time, Date, Money and Misc.

Storing

The RDF data produced in the previous step is sent towards a configured RDF Store.

REST interface

The RDFizer module offers a REST interface [7] for managing domain objects and create / execute conversion jobs. In the following sections we will describe that interface.

The examples below refers to a sample installation available at http://yourserver.org/rdfizer so you must replace that dummy stuff with the address where your server is running / listening.

Enable / Disable RDFizer

The services allows a client to turn off / on the RDFizer. If the RDFizer is turn off, it is running but it won't accept any further request. Both of the are idempotents, meaning with that several subsequent invocations of the same method won't have any effect on the system state.

The following table describes these services:

Attribute	Value	Description
Method	PUT	N.A.
Address	http://yourserver.org/rdfizer/enable	Enable the RDFizer.
Address	http://yourserver.org/rdfizer/disable	Disable the RDFizer

The following table summarizes responses of this service:

Status Code	HTTP Label	Description
200	Ok	The request has been accepted and the RDFizer is now in the requested state.
500	Internal Server Error	This is a general not-well-unknown and catch-all error. System administator can check the log file in order to see exactly what happened. Most probably the client is not supposed to retry the request since this is a permanent failure.

Create a new (conversion) job

The service allows a client to create a new conversion job. This is the core service of the module and, although the semantic of the invocation is very simple, it is important to describe what happens behind the scenes.

The following table describes this service:

Attribute	Value	Description
Method	PUT	N.A.
Address	http://yourserver.org/rdfizer/jobs/{jobId}	The {jobId} is the identifier of the requested job.

As you can see, system doesn't expect a lot of parameters, just an identifier that will be associated with a job instance. In order to create a valid job instance, the RDFizer expects a row on the rdfizer_job_instances table having that identifier as primary key. That table contains all what the RDFizer needs for creating and running the new job. Specifically

Column	Value / Domain / Example	Description
Method	PUT	N.A.
job_id	283929 (example)	A unique INTEGER identifier of the job instance that is going to be created.
datafile	/pluto/pippo/paperino.xml (example)	The absolute path of the input datafile. The file must be readable and writable.
format	lido,marcxml (domain)	a code indicating the format of the input datafile. Valid values (at the moment) are lido and marcxml
namespace	http://yourorganization.org/rdf# (example)	The prefix that will be used for generating URIs. Note that the system will prepend id/resource to that value.
aliada_ontology	http://aliada-project.eu/2014/aliada-ontology/ (example)	the prefix used in ALIADA ontology.

So, concretely, when a request is received with 1234 with identifier, the RDFizer queries the database in order to find a row with 1234 as primary key, retrieve all the configuration above, performs some preliminary validations and starts the job.

It is important to underline that this service is asynchronous, meaning with that a 200 OK response (or 201 Created) doesn't mean the job has been completed but only that it has been created. The effective start of that job is not under the control of the caller: on its side the client can use the monitoring services (REST or JMX) in order to follow the progress of the job.

The following table summarizes all possible responses of this service:

Status Code	HTTP Label	Description
201	Created	The job has been created, it will be started as soon as possible.
406	Not Acceptable	The RDFizer cannot accepts the request because it has been disabled.
400	Bad Request	The RDFizer detected a bad request (job identifier parameter is missing, datafile is not valid, format is unknown)
404	Not Found	The system cannot find a job configuration associated with the given identifier.
500	Internal Server Error	This is a general not-well-unknown and catch-all error. System administator can check the log file in order to see exactly what happened. Most probably the client is not supposed to retry the request since this is a permanent failure.

Get details about a job

The service allows to get a detailed summary of a given job. It accepts only one parameter, which is the job identifier.

The following table describes this service:

Attribute	Value	Description
Method	GET	N.A.
Address	http://yourserver.org/rdfizer/jobs/{jobId}	As you imagine, the {jobId} is the identifier of the requested job.

The following table summarizes all possible responses of this service:

Status Code	HTTP Label	Description
200	Ok	The request has been accepted and the RDFizer is now in the requested state.
400	Bad Request	The RDFizer detected a bad request (job identifier parameter is missing)
404	Not Found	The system cannot find a job configuration associated with the given identifier.
500	Internal Server Error	This is a general not-well-unknown and catch-all error. System administator can check the log file in order to see exactly what happened. Most probably the client is not supposed to retry the request since this is a permanent failure.

In case of 200 OK the job data can be returned, depending on the client accept header, in JSON or XML format. System defaults to XML. This is an example of a response

    <job>
        <id>1</id>
        <completed>false</completed>
        <running>true</running>
        <format>lido</format>
        <total-records-count>12992</total-records-count>
        <processed-records-count>828</processed-records-count>
        <output-statements-count>919228<output-statements-count>
        <records-throughput>122.2</records-throughput>
        <triples-throughput>2547.91</triples-throughput>
        <status-code>-1</status-code>
        <validation-message>
            <description>
              Conflicts
              - Warning ("range check"): "Literal value for object property (prop, value)"
              Culprit = http://www.szepmuveszeti.hu/id/resource/E19_Physical_Object/szepmuveszeti.hu_object_29
              Implicated node: http://erlangen-crm.org/current/P1_is_identified_by
              Implicated node: 'PippoPlutoEPaperino'
            </description>
            <messageType>range check</messageType>
        </validation-message>
        <validation-message>
            <description>
              Conflicts
              - Warning ("range check"): "Literal value for object property (prop, value)"
              ...
            </description>
            <messageType>...</messageType>
        </validation-message>
        ...
    </job>

The following table describes the attributes shown in the example:

Attribute	Description
id	The job identifier.
completed	A value of true means all records belonging to this job have been processed.
running	A value of true means that the job is still running. Although similar, his is not the same information of the previous attribute because a job could be in a running state but not yet completed.
format	The format associated with the job.
total-records-count	The total number of records beloging to the job.
processed-records-count	The total number of records of this job that have been processed.
output-statements-count	The total number of statements emitted by this job.
records-throughput	Record processing throughput in terms of records / sec.
triples-throughput	Triples production throughput in terms of triples / sec.
status-code	The status code of the job. 0 if the job status is Ok, -1 if a validation error occurs.
validation-message	If a validation error occurs, each validation message is reported.

Module Configuration

The module configuration is very simple. The RDFizer configuration has just few properties

marcxml.input.dir=/work/data/aliada/marcxml
lido.input.dir=/work/data/aliada/lido
dc.input.dir=/work/data/aliada/dc
auth.input.dir=/work/data/aliada/auth
ner.classifier=/work/data/aliada/classifiers/english.all.7class.distsim.crf.ser.gz

Property	Description
marcxml.input.dir	This is the listen directory for MARCXML input data. NOTE: the input datafile mustn't be placed in this directory manually; It's up to the RDFizer itself to move the data associated with the incoming request here, within this folder.
lido.input.dir	Same meaning of the marcxml.input.dir but this is for LIDO datafiles.
dc.input.dir	Same meaning of the marcxml.input.dir but this is for Dublin Core datafiles.
auth.input.dir	Same meaning of the marcxml.input.dir but this is for authority records.
ner.classifier	The full qualified (i.e. absolute) path of the classifier that will be used for the Named Entity Recognition. You can find a sample file under the respository (https://github.com/ALIADA/aliada-tool/tree/master/aliada/src/site/nlp) or you can download one of the available classifiers here: http://nlp.stanford.edu/software/CRF-NER.shtml

The system tries to find these properties with the following logic:

First, if a file called pipeline-settings.properties exists on classpath it will be loaded;
Then, for those properties not specified in the preceding file, the system will load default values from the bundled default-pipeline-settings.properties. Note that since most of settings are paths, the default values make sense only for development purposes (unless the production paths perfectly correspond to development paths)

JMX

RDFizer internally relies on JMX for managing and monitoring domain objects and their lifecycle. At the time of writing two main MBeans types are available on the module. The first is the MBean associated with the RDFizer itself, the second is a transient entity created for each job. This latter registers, while the job is in progress, record and triples stats / throughput.

Database tables

RDFizer needs an RDBMS to store job data. Specifically, it expects 2 tables:

rdfizer_job_instances: each row contains all configuration data for a specific job instance;

Attribute	Description
id	The job identifier.
datafile	The datafile associated with this job instance
format	The format associated with the job. Valid values at the moment are lido / marcxml
namespace	The prefix that will be used when generating URIs (i.e. the owning istitution namespace)
graph_name	The name (i.e. URI) of the graph that will be associated with the job.
aliada_ontology	The URI of the Aliada ontology used in the translation.
sparql_endpoint_uri	The URL of the SPARQL endpoint where triples will be inserted.
sparql_endpoint_login	The user that will be used in RDFStore connections.
sparql_endpoint_password	The password that will be used in RDFStore connections.
start_date	Job start date.
end_date	Job end date.

rdf_job_stats: once a job is completed, useful stats are collected in this table. Prior to that, runtime stats about a given job are available using JMX. Note that this is an internal detail because the "GetJob" REST service will provide the necessary abstraction (i.e. if the job is completed then stats coming from table, if it isn't then the MBean will provide those information).

Attribute	Description
job_id	The job identifier.
total_records_count	The total number of records beloging to the job.
total_triples_produced	The total number of statements emitted by this job.
records_throughput	Record processing throughput in terms of records / sec.
triples_throughput	Triples production throughput in terms of triples / sec.

You can find the DDL of those 2 tables under the project (beside all other needed DDL) in the following path:

https://github.com/ALIADA/aliada-tool/tree/master/aliada/src/site/database

As a general note, each script has a number as prefix: that indicates the execution order.

Links and references

[1] https://github.com/ALIADA/aliada-tool/tree/master/aliada/aliada-rdfizer/src/test/resources/lidoMFAB.xml
[2] https://github.com/ALIADA/aliada-tool/tree/master/aliada/aliada-rdfizer/src/test/resources/lidoMFAB-CIDOC-CRM.rdf
[3] http://erlangen-crm.org/efrbroo/
[4] http://www.loc.gov/cds/downloads/FRBR.PDF
[5] http://camel.apache.org
[6] http://www.cidoc-crm.org/
[7] http://en.wikipedia.org/wiki/Representational_state_transfer
[8] http://nlp.stanford.edu/software/CRF-NER.shtml

RDFizer - ALIADA/aliada-tool GitHub Wiki

RDFizer

Supported formats and ontologies

Dublin Core

LIDO

MARCXML AND FRBR

How does it works?

Input data validation

Records count

RDF Validation

Split

FRBR Entity Detection (MARCXML records only)

Conversion

NER

Storing

REST interface

Enable / Disable RDFizer

Create a new (conversion) job

Get details about a job

Module Configuration

JMX

Database tables

Links and references

⚠️ GitHub.com Fallback ⚠️

RDFizer - ALIADA/aliada-tool GitHub Wiki

RDFizer

Supported formats and ontologies

Dublin Core

LIDO

MARCXML AND FRBR

How does it works?

Input data validation

Records count

RDF Validation

Split

FRBR Entity Detection (MARCXML records only)

Conversion

NER

Storing

REST interface

Enable / Disable RDFizer

Create a new (conversion) job

Get details about a job

Module Configuration

JMX

Database tables

Links and references

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️