Design decissions - renepickhardt/related-work.net GitHub Wiki

general structure:

we have to have own URIs. (prefix rw:)
URI's will be speaking and generated by some function from "name"-property of a node (title for paper, name for author, name for conference)
for edges from neo4j edgetype
if other URI's are known (or doi) we put them to properties internally but in RDF of course we link data!

transfer of data and queries

when loading a page (that is basically a node in the graph) we will always do one traversal, collect everything we need, put the properties into a transferable container object and thats it! on the client side this object can sort the results (by different metrics) and render html.

Especially when a page is shown we only show top-k results for any box. Even though we do that all content is transfered via RPC and ONE query! (see more in client side caching)

Data Mining

to make queries faster we need caching in the graph. eg via crownjobs enter edges that where not there before. while doing so those edges should have a certain edgeTyp e.g. "rw:dm:...". in this way it is easy to remove and recalculate them. where rw: is our usual prefix dm: stands for data mining. this can for example be done to calculate the coauthor graph.

really important every time a page is requested one can decide at what time the last data mining of this page took place. If it is to far in the past one can push the node to a toDoDataMiningQueue(). The elements from the queue are always taken by an ever running serverside script that recalculates everthing in the neighbourhood of the queue. One can add random to the process of pushing to the queue and also other metrics than time. In any case since data mining are always local queries this should work as a strategy.

Caching

serverside

for pages with really high Page Rank or which are retrieved often we put the query results into a query result cache! (could also be a document store or something similar! since query results are serializable they can even be written to disk!) same holds true for pages that are loaded often. we need metrics for this. but these kind of things should be included right away to guarantee speed user experience and of course web scale.

clientside

html string of uris can be put in a client cache, so if user returns to a page it can be fetched from browser cache via java script

in order to save space one could just cache the transferable containers.

consider html5 even enables to store those things persistantly

caching strategy

Even though those thoughts are nice, we need a more sophisticated caching strategy. especially we need to know how long to hold things in cache and also how to communicate from server to clients if locally cached objects are updated (server push?)

SAME AS QUESTION

it is really hard to decide weather same as edges should be included. an querries are getting more complex. or if nodes should really be merged in the backend, keeping some backup of the old one or different rdf serializations!

after discussion it will be solved as follows:

we create a new node with one of the existing uris (the ref node!)
we attach all edges from old nodes to the new node
old edges are exchanged to edges that are not followed (the old nodes will exist, for history reasons and reverts)
properties will be copied to new edge
especially we have a propertie that is searchable (which consists of name (and names of same as nodes..))
old URIs will redirect to new
original uris will be stored as properties.
the node that corresponds to the refnode will change uri to outdated and will not be indexable anymore.

Database queries and connections

if using Neo4j first benchmarks show the following: Using the java API is about 25% faster than using the Traversal framworks and outperforms cypher Query Language by a factor of 8 till 9. Also we realized that queries are evaluated fast if using EmbeddedReadOnlyGraphDatabase. Therefor it might make sense to maintain two data base connections. One just for reading. and one for writing! Question that needs to be evaluated: will the the readonly version be able to read stuff that was written by the other connection? My guess would be yes.