RDFizer - ALIADA/aliada-tool GitHub Wiki
As part of the main goal of the ALIADA tool, RDFizer, as the name suggests, is a conversion / translation engine. It supports several (input) formats and allows to translate those kind of data in RDF statements. The main component of the RDFizer is an asynchronous pipeline that divides the translation process in several steps (e.g. input validation, format detection, stream split, conversion, indexing) by means of Apache Camel [5].
At the moment asynchronous channels are implemented completely in memory (using seda or direct endpoints). However, thanks to Apache Camel and its high degree of modularity, it is very easy to move towards a more scalable and reliable implementation, using a MOM for example (e.g. ActiveMQ): that is just a matter of configuration.
The module artifact itself is a web application, which manages the pipeline and offers a REST interface (see below) for creating, monitoring and managing jobs. For technical information (e.g. how it is implemented, how to expand the RDFizer with additional formats) please see the Developer Guide.
The overall project started with three kind of target data: LIDO, MARCXML and DC. For information about their support and the current status of the project see the Roadmap page. The RDFizer has been built with extensibility in mind, meaning that, for a developer, it is relatively easy to add a new format and to define a new kind of translation based on a different ontology. For this kind of information, see the Developer pages.
The DC format has been used in ALIADA tool as a reference for converting data coming from one of the project partener, which is a Museum. DC data has been exported from Drupal, which is used as Content Management System in that institution. In general the mapping is easier than the other formats, as the DC does have a simple (moreless flat) structure.
The LIDO format has been used in ALIADA tool as a reference for converting data coming from Museums. Within the project source code we included a sample LIDO record [1] coming from MFAB. The current translation engine makes use of templates that translate LIDO data in RDF using the CRM-CIDOC [6] part of the EFRBROO [3] ontology. Other than that sample LIDO record, just beside to that, you can see the corresponding output in RDF [2].
MARCXML is the xml version of the standard bibliographic format and it is used to translate, as you imagine, bibliographic records. The choosen ontology, EFRBROO [3], due to implicit FRBR specs [4], implies a more complex conversion process: before starting the conversion job, a previous FRBR entity detection is needed; specifically, starting from a given MARCXML record, the system tries to detect the following FRBR entities:
- Works, Expressions, Manifestation;
- Persons, Organizations, Groups;
- Themes
Item entities, which are part of the first group, haven't been considered at the moment, since they represents local data and therefore not so important for a linked data catalog.
At the end of the entity detection process, for each detected entity, the system assigns a given URI. With that set of URIs the conversion process can happen, therefore applying the EFRBROO ontology.
The conversion pipeline is composed by several steps, each of them tries to execute a little piece of the overall translation job.
The input data must be available on a file, and that file needs to be checked in order to make sure a set of formal preconditions are met. Specifically, the file must exists and must have read and write permissions (the system will need to move the file during the conversion process). In addition, once invoked (with an identifier), the newJob service will look for a database row in rdfizer_job_instances table. That row is supposed to contain all information needed for a proper translation.
The incoming stream needs to be "sized". This is not strictly important for the conversion itself but for monitoring, in order to communicate and understand the progress percentage of the overall job. So this is the first important phase of the conversion process. Without this information, the RDFizer cannot know the progression of the overall job and therefore it is not able to understand when a conversion has been completed.
Although the Semantic Web follows the AAA principle (Anyone can say Anything on Any topic), the module offers a complementary validation service, that checks a formal adherence to the ALIADA ontology of the produced triples. In case one ore more violations are raised, the job is stopped and the validation messages are stored within the job definition. In this way, a requestor can see and analyze those messages using the getJob REST service (described below).
Assuming that an input file, regardless the format, will include more than one record, this step will split the incoming stream by producing several chunks of it; each chunk corresponds to a record that will be processed in parallel.
As briefly explained, this step, valid only for MARCML records, detects the FRBR entities and assign them an identity (URI). The entity detection rules can be found within the ALIADA configuration files. Each entity has been associated with one or more "extractors" that are in charge to select values from the bibliografic record and use them for identifying that entity. Keep in mind that if, at the end of this process, for a given record, it hasn't been possible to detect at least one entity of the first FRBR group, the record will be discarded. This event is printed out in the RDFizer log file.
By means of a conversion template, associated to a given record by the engine, the conversion happens. The output of this step is a set of RDF triples in N3 format. ALIADA uses Apache Velocity as a templating language for creating conversion templates; it is easy, fast and it allowed us to create a kind of internal DSL; that would make possible an easy extension of the overall templating engine (if for example, you may want to manage different MARC dialects).
As part of the conversion process, we integrated (in release 2.0) a Named Entities Extractor, which detects a set of entities from the unstructured fields of the input records (e.g. a note field on a MARC record). The extracted information (i.e. entities) are then converted in RDF following the ALIADA ontology. This is just an example:
Darwin rdf:type skos:Concept
PersonsCollection skos:member Darwin
E19_Physical_Object1 ecrm:P137_exemplifies Darwin
Darwin skos:prefLabel "Darwing" .
The integrated NER engine comes from the Stanford NLP Project [8]. It can detect, depending on the configured classifiers, to detect several entities like Persons, Organization, Places, Time, Date, Money and Misc.
The RDF data produced in the previous step is sent towards a configured RDF Store.
The RDFizer module offers a REST interface [7] for managing domain objects and create / execute conversion jobs. In the following sections we will describe that interface.
The examples below refers to a sample installation available at http://yourserver.org/rdfizer so you must replace that dummy stuff with the address where your server is running / listening.
The services allows a client to turn off / on the RDFizer. If the RDFizer is turn off, it is running but it won't accept any further request. Both of the are idempotents, meaning with that several subsequent invocations of the same method won't have any effect on the system state.
The following table describes these services:
| Attribute | Value | Description |
|---|---|---|
| Method | PUT | N.A. |
| Address | http://yourserver.org/rdfizer/enable | Enable the RDFizer. |
| Address | http://yourserver.org/rdfizer/disable | Disable the RDFizer |
The following table summarizes responses of this service:
| Status Code | HTTP Label | Description |
|---|---|---|
| 200 | Ok | The request has been accepted and the RDFizer is now in the requested state. |
| 500 | Internal Server Error | This is a general not-well-unknown and catch-all error. System administator can check the log file in order to see exactly what happened. Most probably the client is not supposed to retry the request since this is a permanent failure. |
The service allows a client to create a new conversion job. This is the core service of the module and, although the semantic of the invocation is very simple, it is important to describe what happens behind the scenes.
The following table describes this service:
| Attribute | Value | Description |
|---|---|---|
| Method | PUT | N.A. |
| Address | http://yourserver.org/rdfizer/jobs/{jobId} | The {jobId} is the identifier of the requested job. |
As you can see, system doesn't expect a lot of parameters, just an identifier that will be associated with a job instance. In order to create a valid job instance, the RDFizer expects a row on the rdfizer_job_instances table having that identifier as primary key. That table contains all what the RDFizer needs for creating and running the new job. Specifically
| Column | Value / Domain / Example | Description |
|---|---|---|
| Method | PUT | N.A. |
| job_id | 283929 (example) | A unique INTEGER identifier of the job instance that is going to be created. |
| datafile | /pluto/pippo/paperino.xml (example) | The absolute path of the input datafile. The file must be readable and writable. |
| format | lido,marcxml (domain) | a code indicating the format of the input datafile. Valid values (at the moment) are lido and marcxml |
| namespace | http://yourorganization.org/rdf# (example) | The prefix that will be used for generating URIs. Note that the system will prepend id/resource to that value. |
| aliada_ontology | http://aliada-project.eu/2014/aliada-ontology/ (example) | the prefix used in ALIADA ontology. |
So, concretely, when a request is received with 1234 with identifier, the RDFizer queries the database in order to find a row with 1234 as primary key, retrieve all the configuration above, performs some preliminary validations and starts the job.
It is important to underline that this service is asynchronous, meaning with that a 200 OK response (or 201 Created) doesn't mean the job has been completed but only that it has been created. The effective start of that job is not under the control of the caller: on its side the client can use the monitoring services (REST or JMX) in order to follow the progress of the job.
The following table summarizes all possible responses of this service:
| Status Code | HTTP Label | Description |
|---|---|---|
| 201 | Created | The job has been created, it will be started as soon as possible. |
| 406 | Not Acceptable | The RDFizer cannot accepts the request because it has been disabled. |
| 400 | Bad Request | The RDFizer detected a bad request (job identifier parameter is missing, datafile is not valid, format is unknown) |
| 404 | Not Found | The system cannot find a job configuration associated with the given identifier. |
| 500 | Internal Server Error | This is a general not-well-unknown and catch-all error. System administator can check the log file in order to see exactly what happened. Most probably the client is not supposed to retry the request since this is a permanent failure. |
The service allows to get a detailed summary of a given job. It accepts only one parameter, which is the job identifier.
The following table describes this service:
| Attribute | Value | Description |
|---|---|---|
| Method | GET | N.A. |
| Address | http://yourserver.org/rdfizer/jobs/{jobId} | As you imagine, the {jobId} is the identifier of the requested job. |
The following table summarizes all possible responses of this service:
| Status Code | HTTP Label | Description |
|---|---|---|
| 200 | Ok | The request has been accepted and the RDFizer is now in the requested state. |
| 400 | Bad Request | The RDFizer detected a bad request (job identifier parameter is missing) |
| 404 | Not Found | The system cannot find a job configuration associated with the given identifier. |
| 500 | Internal Server Error | This is a general not-well-unknown and catch-all error. System administator can check the log file in order to see exactly what happened. Most probably the client is not supposed to retry the request since this is a permanent failure. |
In case of 200 OK the job data can be returned, depending on the client accept header, in JSON or XML format. System defaults to XML. This is an example of a response
<job>
<id>1</id>
<completed>false</completed>
<running>true</running>
<format>lido</format>
<total-records-count>12992</total-records-count>
<processed-records-count>828</processed-records-count>
<output-statements-count>919228<output-statements-count>
<records-throughput>122.2</records-throughput>
<triples-throughput>2547.91</triples-throughput>
<status-code>-1</status-code>
<validation-message>
<description>
Conflicts
- Warning ("range check"): "Literal value for object property (prop, value)"
Culprit = http://www.szepmuveszeti.hu/id/resource/E19_Physical_Object/szepmuveszeti.hu_object_29
Implicated node: http://erlangen-crm.org/current/P1_is_identified_by
Implicated node: 'PippoPlutoEPaperino'
</description>
<messageType>range check</messageType>
</validation-message>
<validation-message>
<description>
Conflicts
- Warning ("range check"): "Literal value for object property (prop, value)"
...
</description>
<messageType>...</messageType>
</validation-message>
...
</job>The following table describes the attributes shown in the example:
| Attribute | Description |
|---|---|
| id | The job identifier. |
| completed | A value of true means all records belonging to this job have been processed. |
| running | A value of true means that the job is still running. Although similar, his is not the same information of the previous attribute because a job could be in a running state but not yet completed. |
| format | The format associated with the job. |
| total-records-count | The total number of records beloging to the job. |
| processed-records-count | The total number of records of this job that have been processed. |
| output-statements-count | The total number of statements emitted by this job. |
| records-throughput | Record processing throughput in terms of records / sec. |
| triples-throughput | Triples production throughput in terms of triples / sec. |
| status-code | The status code of the job. 0 if the job status is Ok, -1 if a validation error occurs. |
| validation-message | If a validation error occurs, each validation message is reported. |
The module configuration is very simple. The RDFizer configuration has just few properties
marcxml.input.dir=/work/data/aliada/marcxml
lido.input.dir=/work/data/aliada/lido
dc.input.dir=/work/data/aliada/dc
auth.input.dir=/work/data/aliada/auth
ner.classifier=/work/data/aliada/classifiers/english.all.7class.distsim.crf.ser.gz
| Property | Description |
|---|---|
| marcxml.input.dir | This is the listen directory for MARCXML input data. NOTE: the input datafile mustn't be placed in this directory manually; It's up to the RDFizer itself to move the data associated with the incoming request here, within this folder. |
| lido.input.dir | Same meaning of the marcxml.input.dir but this is for LIDO datafiles. |
| dc.input.dir | Same meaning of the marcxml.input.dir but this is for Dublin Core datafiles. |
| auth.input.dir | Same meaning of the marcxml.input.dir but this is for authority records. |
| ner.classifier | The full qualified (i.e. absolute) path of the classifier that will be used for the Named Entity Recognition. You can find a sample file under the respository (https://github.com/ALIADA/aliada-tool/tree/master/aliada/src/site/nlp) or you can download one of the available classifiers here: http://nlp.stanford.edu/software/CRF-NER.shtml |
The system tries to find these properties with the following logic:
- First, if a file called pipeline-settings.properties exists on classpath it will be loaded;
- Then, for those properties not specified in the preceding file, the system will load default values from the bundled default-pipeline-settings.properties. Note that since most of settings are paths, the default values make sense only for development purposes (unless the production paths perfectly correspond to development paths)
RDFizer internally relies on JMX for managing and monitoring domain objects and their lifecycle. At the time of writing two main MBeans types are available on the module. The first is the MBean associated with the RDFizer itself, the second is a transient entity created for each job. This latter registers, while the job is in progress, record and triples stats / throughput.
RDFizer needs an RDBMS to store job data. Specifically, it expects 2 tables:
- rdfizer_job_instances: each row contains all configuration data for a specific job instance;
| Attribute | Description |
|---|---|
| id | The job identifier. |
| datafile | The datafile associated with this job instance |
| format | The format associated with the job. Valid values at the moment are lido / marcxml |
| namespace | The prefix that will be used when generating URIs (i.e. the owning istitution namespace) |
| graph_name | The name (i.e. URI) of the graph that will be associated with the job. |
| aliada_ontology | The URI of the Aliada ontology used in the translation. |
| sparql_endpoint_uri | The URL of the SPARQL endpoint where triples will be inserted. |
| sparql_endpoint_login | The user that will be used in RDFStore connections. |
| sparql_endpoint_password | The password that will be used in RDFStore connections. |
| start_date | Job start date. |
| end_date | Job end date. |
- rdf_job_stats: once a job is completed, useful stats are collected in this table. Prior to that, runtime stats about a given job are available using JMX. Note that this is an internal detail because the "GetJob" REST service will provide the necessary abstraction (i.e. if the job is completed then stats coming from table, if it isn't then the MBean will provide those information).
| Attribute | Description |
|---|---|
| job_id | The job identifier. |
| total_records_count | The total number of records beloging to the job. |
| total_triples_produced | The total number of statements emitted by this job. |
| records_throughput | Record processing throughput in terms of records / sec. |
| triples_throughput | Triples production throughput in terms of triples / sec. |
You can find the DDL of those 2 tables under the project (beside all other needed DDL) in the following path:
https://github.com/ALIADA/aliada-tool/tree/master/aliada/src/site/database
As a general note, each script has a number as prefix: that indicates the execution order.
[1] https://github.com/ALIADA/aliada-tool/tree/master/aliada/aliada-rdfizer/src/test/resources/lidoMFAB.xml
[2] https://github.com/ALIADA/aliada-tool/tree/master/aliada/aliada-rdfizer/src/test/resources/lidoMFAB-CIDOC-CRM.rdf
[3] http://erlangen-crm.org/efrbroo/
[4] http://www.loc.gov/cds/downloads/FRBR.PDF
[5] http://camel.apache.org
[6] http://www.cidoc-crm.org/
[7] http://en.wikipedia.org/wiki/Representational_state_transfer
[8] http://nlp.stanford.edu/software/CRF-NER.shtml