Future Work - accandme/openplum GitHub Wiki

We do not consider the work to be finished. Besides fixing the limitations mentioned in the Limitations section, there are quite a few things that can be added or optimized in our system. We list the major ones below.

Extending the Query Graph

The query graphs that we build currently are basic. They do not encode any information about the query other than the equijoin conditions. While such conditions are definitely used the most in everyday database querying, there are other conditions that are also used frequently that we currently do not take into account.

The simplest of such conditions are unary (i.e. pertaining to only one relation) qualifier conditions in the WHERE clause. They can be used to limit enormously the amount of data we ship around, and are not that hard to add to our current system. We were envisioning attaching them to our graph vertices, such that the GraphProcessor algorithm can take advantage of them.

Moreover, there are of course all of the relation joins that are not equijoins. These, however, would be more challenging to handle.

Integration with MADlib

Our existing system provides a clean way to execute aggregate functions on the database. All that needs to be done is that the non-distributed variants of these functions need to be split into two each, the intermediate and the final flavors, as mentioned above. MADlib provides a plethora of aggregate functions and includes all the means to provide the two flavors (i.e. the merge function that can merge two different states of an aggregate into one state, which is used as the state transition function for the final version of the aggregate function). For this reason, we believe our current system can be easily extended to support MADlib.

Integration with PostgreSQL

Many limitations of our system stem from the fact that it operates the master node remotely. This also results in below-optimal performance. An excellent continuation of this project would be to port it to C and integrate it into PostgreSQL, namely the master node. This is advantageous for many reasons, some of the most important of which are listed below:

  • the master node will gain access to the catalog of the database, and will be able to make more efficient decisions about graph processing;
  • the system will be able to make use of built-in parser of PostgreSQL;
  • the system will be “closer to the data” and most likely closer to the worker nodes as well;
  • clients currently using a simple non-distributed version of PostgreSQL will be able to seamlessly move to our distributed version of the system.