Persistence - OSLC/ldp-service GitHub Wiki

Current Persistence Approach

ldp-service currently stores triples in the MongoDB database using the native n3.js format. It uses jsonld.js to parse the JSON-LD, then converts its internal representation into n3 triples. This was done to utilize n3.js’s parsers. Sam and Steve’s approach was motivated by Experimenting with MongoDB as an RDF Store by Rob Vesse. MongoDB represented a simple, document-centered, attractive, affordable and accessible DBMS solution, and Rob's approach to supporting RDF in MongoDB seemed workable for a simple LDP implementation.

Some details on the current implementation:

The ldp-service MongoDB database is a single MongoDB collection of resources.
Since there is only one MongoDB collection, all references are within that same collection, so there is never any need for a join.
Each resource is a separate readable and writable item in the MongoDB collection.
Each LDPC is also a a separate entry in the MongoDB collection, it is not a separate MongoDB collection
Child elements (LDPRs and nested LDPCs) have a JavaScript containedBy property to link to their parent LDPC
Each element (LDPR and LDPC) has a small set of JavaScript properties and an array of triples that are the n3.js internal parsed representation of the RDF resource.

The motivation to change

Although this persistence strategy works fine, there are some reasons to look for alternatives:

non-RDF apps can't use simple HTTP requests to get JSON that they might find more useful
The n3.js triples are not a standard format, indeed rdflib.js and jsonld.js use similar but different and incompatible internal representations making it inconvenient to use these packages together
There are a number of transformations that have to be done to get in-memory JSON.
Simple MongoDB queries can't be done against the LDPC collections to access members because the resource data is stored in triples, not a JSON or compacted JSON-LD tree structure more appropriate for MongoDB queries
Using a single MongoDB collection for all resources in an LDP or OSLC server might result in performance issues, especially since a MongoDB query is required to get the members of an LDPC.
There is no support for a standard query language such as SPARQL, or a query language that understands the RDF semantics of the LDPRs in the repository.

If we wanted to stick with this same architecture, but address some of these issues, we could consider changing the resource stored data format. Some possibilities might be:

Stay with current n3 data format: doesn't address the issues, but that's still an option
Use the jsonld.js data format: this would eliminate the transformation from n3.js format to/from jsonld.js format, but would introduce the need to address the missing jsonld.js parsers (jsonld.js currently only supports n-quads).
Use the rdflib.js IndexedFormula format: adds in memory basic graph query and provides RDF/XML parser, but still doesn't support simple MongoDB queries directly on the database. And rdflib.js use of jsonld.js for handling JSON-LD still results in an internal conversion and currently only supports reading, not writing JSON-LD.
Use compacted JSON-LD source format: this can be parsed by JSON directly, is supported by jsonld.js, and would provide more convenient MongoDB queries. But it doesn't support the other required content types including RDF/XML and Turtle. Another potential issue is that unless expanded JSON-LD is used for query, there could be name collisions when doing MongoDB or sift.js queries since the JSON keys are unqualified in the compacted form. And expanded JSON-LD would not be convenient to query.

Requirements

The format changes above are all based on continuing to use MongoDB for persistence and only differ in the format of the JSON object that is used to represent the LDPR in the database. Since none of these is entirely satisfactory, we may wish to look for additional alternatives. In order to do so, we should start with a list of the requirements for the OSLC/LDP persistent store.

Be able to do CRUD operations on at least Linked Data resources, but also non-RDF resources
Support (hopefully standard) query on Linked Data resources
Support read and write of Turtle, JSON-LD, RDF/XML and optionally n3 content types
Be able to deploy apps using ldp-service in Bluemix or other Cloud hosting platforms
Have sufficient performance and scalability that at least simple integration apps can be quickly and easily created
Be sufficiently standard or common that swapping out a different persistence technology for production purposes would not be too difficult
Be useful for client and server side application components, i.e., can be accessed from, or resource representations can be easily used in browser-based applications.

There are possibly two key questions:

Should the persistence strategy be document/resource centered, or should it be data element centered?
What should the query interface be?

These questions are related. OSLC, LDP and HTTP are all resource, or document centered RESTful APIs. The unit of access is GET on a URI and an entity response document is returned in a client-preferred content type that can be conveniently consumed. Queries in this case are executed by the server which constructs the entity response body and returns it to the client.

In the element-centered approach, the persistent store provides direct query access on potentially fine-grained elements. For example, and RDF store or knowledge base (KB) would typically be queried using SPARQL to get result sets or to CONSTRUCT RDF graphs. This approach is more flexible because and OSLC or LDP server can be designed to construct the entity response bodies for a client request using SPARQL, while other clients might use the SPARQL endpoint directly to for more open queries on the data.

Said another way: its easier to make RDF/SPARQL support document centered resource access than it is to make a document centered repository support fine-grained, standard queries.

There are two broad choices depending on what we want to do with the parsed results from an OSLC or LDP request, or from a more fine-grained query:

RDF Focus: In-memory and persistent formats leverage RDF semantics and SPARQL query consistent with OSLC and LDP
JSON Focus: In-memory and persistent formats leverage JSON and MongoDB query and sift.js consistent with JavaScript

If we consider the RDF vs JSON focus, then we're probably talking about SPARQL vs. MongoDB/sift.js. There may be other choices based on different NoSQL options, but the arguments are probably pretty similar.

Some observations on each of the requirements above:

Any database can do CRUD operations on resources. We don't really care what the stored data format is as long as the query interface is usable and the OSLC and LDP content types are supported. Since OSLC is built on LDP which is built on RDF and HTTP, that's a strong argument for an RDF-friendly persistence strategy.
Users tend to struggle with SPARQL, and IBM did its best to use Shapes to turn RDF into a traditional visual structured query language for Jazz Reporting. However, I recall some pretty complex SQL queries and resistance to SQL adoption in the past. And is MongoDB query or sift.js really that much simpler than SPARQL, especially if the LDPRs are stored as n-triples, or expanded JSON-LD in MongoDB? I suspect not when used in production apps.
As the current ldp-service demonstrates, it is possible to use MongoDB to store LDP resources in n3 triples and use n3.js to provide the required content types - except RDF/XML which is required for Jazz products and OSLC2 compatibility. So this is a gap in the current n3/MongoDB implementation.
There is currently no RDF storage service on Bluemix. There are many possible candidates including Apache TDB, 4store, 5store, Stardog, Virtuoso, or many others. One interesting possibility would be to use IBM Graph which is a Bluemix service for a Graph Data Store built on Apache TinkerPop 3 and including the Gremlin graph-specific query language. This would require some additional work to use an RDF interface to IBM Graph using something like Blazegraph or Stardog. In the meantime, Bluemix apps can be configured to access storage services offered from other Cloud service providers. Starting this way might generate the demand necessary to establish a more direct RDF Bluemix service.
There are certainly production storage services for triple stores and other NoSQL databases. Some applications have noted performance and scalability issues with Jena, and performance problems with complex SPARQL queries. There could be many reasons for this, including perhaps attempting to coerce a closed-world structured view on open-world linked data.
RDF and SPARQL are currently the only standards for NoSQL database and query language, and are native to RDF, the OSLC and LDP resource format.
Clients typically get result sets from data queries which represent relatively unstructured name/value pairs that although easy to consume, are often a semantic mismatch with the applications models, views and/or controllers. But both SPARQL and MongoDB/sift.js support additional client-side (sub) queries. For SPARQL, a CONSTRUCT query can be used to get a graph and then RDF APIs that support in-memory basic graph queries can be used to do additional rich queries on the client side. Similarly sift.js can be used to query tree-structured JSON.

These observations seem to lead to the conclusion that OSLC and LDP would benefit greatly from using an RDF triple store for persistence that supported the required content types and standard SPARQL queries. This may not be true for other integration technologies, but it seems to be clearly the case for OSLC and LDP (for better or worse).

Summarizing:

RDF Focus (Leaning here)

Pros:

powerful basic graph query capability for in memory queries along with SPARQL for database queries
useful if app interaction is mostly through SPARQL queries, result sets, or constructed graphs.
rdflib.js has most of the required capabilities and has recently been getting updates/contribution
already parses RDF/XML, Turtle, and JSON-LD
its possible to support inferencing using a suitable triple store such as Jena TDB.

Cons:

large and may not be a browser-friendly technology for local client development
RDF BGS queries are cumbersome and unfamiliar to many developers, but are not that difficult
Currently rdflib.js can’t serialize JSON-LD
No support for processing JSON-LD other than parsing into a KB
May not be the simplest fit for OSLC/LDP/HTTP document centered APIs

JSON Focus:

Pros:

more familiar to developers
useful with MongoDB if the app interaction is mostly HTTP/LDP document centered access
could be directly queried using MongoDB if that was the persistence format too
query with sift.js (MongoDB like query language, not SPARQL-like)
Smaller, more browser friendly

Cons:

No support for native RDF queries, so cannot leverage the direct power of RDF and linked-data
Structural queries are subject to no longer working if the structure changes
jsonld.js doesn’t currently parse RDF/XML, Turtle or n-triples
Not possible to ever introduce inferencing.
APIs could not be easily built directly on a triple store

Implementation Strategy

Let's look at possible implementation strategies for the RDF and JSON approaches.

RDF Focus

There are a number of Node modules that support RDF capabilities.

jsonld.js

provides the ability to read and write JSON-LD resources
supports expansion and compaction of JSON-LD as defined in the JSON-LD Specification
is relatively small and can be used in browser-based apps
may be preferred in client-side or browser-based apps because its JSON
supports extensible parsers, but only provides support for n-quad parsing out of the box

n3.js

Provides a fast, simple and easy to use parser and serializer for Turtle, TriG, N-Triples and N-Quads
Doesn't support any in-memory graph query
Doesn't support RDF/XML

rdflib.js

probably provides the closest RDF API to meet all the requirements
preferred by server side apps that wish to directly leverage RDF semantic and SPARQL query capabilities
supports all the content types (except the ability to write JSON-LD, but that can be added)
has the in-memory basic graph query capability that makes getting information from RDF resources easier
is still being actively developed and lead by Tim Berners-Lee
clients can still get JSON-LD and use sift.js to query that if they want. Its much harder to add SPARQL to a non-RDF or pseudo-RDF store to get both options
already includes jsonld.js for parsing JSON-LD.
its a bit bulky for browser-based clients

So rdflib.js looks like the best starting point for the RDF focus approach, based on the number of supported content types/parsers, and the ability to do in-memory basic graph queries. This reduces the gap between queries made on RDF triple stores using SPARQL, and in-memory queries on the HTTP parsed entity response bodies. Both use similar technology and RDF semantics.

To use rdflib.js for OSLC and LDP the following tasks may need to be addressed:

Migrate jsonld.js to rdflib.js internal representation so that jsonld.js can be used to directly parse and serialize JSON-LD without going through an extra in-memory data transformation of all the triples.
Alternatively (but not ideal) implement the rdflib IndexedFormula to jsonld data set transform to support a JSON-LD serializer.
Explore using rdflib.js in a browser application to ensure there are no show stoppers.
Push my change to rdflib.js to the main development branch that provides proper handling of XMLLiterals.
Migrate ldp-service to use an RDF triple store that supports a SPARQL endpoint for persistence, not MongoDB

JSON Focus

This is perhaps a less expensive, but less capable approach since there's no support for SPARQL queries. ldp-service could continue using MongoDB for persistence, but there would be some work to support more of the OSLC and LDP requirements.

Create jsonld.js parsers for Turtle, N3 and RDF/XML.
For Turtle and N3, use n3.js to parse the source and then convert the n3 internal representation to jsonld.js internal triple data format.
For RDF/XML, fork the RDF/XML parser from rdflib.js and update it to create the jsonld.js internal triple data format.
Change the stored data format in MongoDB for LDP resources to JSON-LD instead of n3 triples. This reduces the data conversions for JSON-LD (the most common content type) and allow MongoDB query to be more easily used to query the database directly. sift.js would be used to do similar queries on in-memory JSON objects.

Conclusion

The evidence above would seem to clearly lead to the conclusion that an RDF persistent store that supports a SPQRQL endpoint is the best fit for OSLC and LDP persistence services. Summarizing the motivation:

OSLC and LDP are built on RDF, so having a persistence architecture that has direct support for RDF and RDF semantics (including inferencing) is useful
rdflib.js supports most of the content types required by OSLC and LDP, only some additional work for an JSON-LD serializer is required.
An RDF KB can be used to support document centered, or fine-grained queries.
RDF and SPARQL are standards supported by a number of storage implementations (including popular property graph databases) with good APIs in most languages.
Using RDF and SPARQL standards will facilitate integration between applications, and provide semantically rich queries on fine-grained resources.

The risk with taking this approach is that RDF triple stores and the SPARQL query language have some issues and barriers to adoption that have possibly lead to other, more document centered or property graph centered NoSQL solutions. These more fashionable and popular approaches may limit the relevance of OSLC for future integration initiatives. But these other approaches are non-standard, and may suffer from some of the same QoS, scalability and usability concerns as well as create dependencies on rapidly changing, non-standard solutions. These may create greater integration risks then the performance and usability issues of RDF and SPARQL. See How RDF Databases Differ from Other NoSQL Solutions. This is a bit dated, but provides some interesting insight.

In any case, OSLC and LDP are built on RDF, and any effort to support RDF resources in a non-RDF database would seem to add unnecessary risk and potential compatibility issues. And an OSLC or LDP client should have no idea how a server persists its resources, this should be private, hidden, and substitutable, focusing on delivering the required QoS and not on technologies.