Introduction - accandme/openplum GitHub Wiki

Large-scale data analytics has been the topic of intense commercial interest as well as academic research for many years. The growth of storage systems beyond a single machine have rendered traditional relational database management systems unusable for data storage for the purpose of easy querying. Moreover, the difficulty of running SQL queries in a distributed fashion has led to the relaxation on the requirements of data management and emergence of the NoSQL paradigm, where data is not stored in relations and where traditional SQL cannot be used for data manipulation. While it is true that NoSQL databases are highly optimized for retrieval and appending operations, they provide little functionality beyond that. The common approach to doing analytics on data stored in a NoSQL database is therefore to extract the necessary portions of the data and manually run them through a statistical package. It is needless to say that this approach is time-consuming and error-prone.

While plain SQL does not provide rich analytical capabilities itself, it does provide useful data manipulation techniques that can be applied to do at least some analytics in-database. Support for user-defined aggregate functions is also on the rise among modern RDBMSs, expanding the boundaries of analytics that can be performed by using SQL queries. Ability to create user-defined aggregate functions is readily supported by PostgreSQL, Oracle, and SQL Server, among possibly others, and PostgreSQL even has a full-fledged open-source data analytics library available, called MADlib. Despite these advances in RDBMS capabilities, relational databases still struggle in growing beyond a single node. In this project, we attempt to build a distributed query engine that can execute any arbitrary SQL query on a database partitioned on a set of worker RDBMS nodes. Our focus is to use each worker node’s capabilities to its fullest, doing as much processing in parallel as possible, while at the same time taking into account the fact that the query may contain aggregates and nested queries, and that the underlying data partitioning may be arbitrary. Our system is implemented as a Java application using PostgreSQL worker nodes.

In the Design section, we discuss how we go about executing an arbitrary query and describe our algorithms. In the Implementation section, we give the implementation details of our ideas and present our pipeline of query execution. In the Evaluation section, we give details on how well our system performs.