Evaluation - accandme/openplum GitHub Wiki

For evaluating our system, we used the TPC-H schema introduced above, where the relations Customer, Orders, and LineItem were partitioned equally on all nodes, and the other relations were stored fully on one node. Our system, however, does not know how the data is partitioned so the execution is the same as it would be if all relations were partitioned arbitrarily. We evaluated our system on scales 0.001 and 0.01 of the TPC-H benchmark database, on all queries of the benchmark that are currently supported by our system (12 out of 22). Because we did not have access to a networked set of PostgreSQL nodes, we evaluated our system on one PostgreSQL server with 8 databases, where each stored a partition of the data. The machine is a Toshiba Tecra laptop with an Intel(R) Core(TM)2 Duo CPU T9400 running at 2.53GHz, and 3.8 GiB of RAM. We used PostgreSQL 9.0.

Query 0.001 0.01
1 1.544 s 0.928 s
3 7.034 s 12.800 s
5 14.795 s 18.825 s
6 0.818 s 0.326 s
7 15.989 s 35.661 s
8 17.593 s 39.225 s
9 15.470 s 31.959 s
10 9.813 s 34.362 s
11 0.772 s 1.039 s
13 3.016 s 6.161 s
14 2.567 s 6.910 s
18 1.988 s 7.151 s

Table 1. Evaluation results.

As can be seen in the table, for some queries the execution time on the smaller dataset is higher than the execution time on the larger dataset. This is the case because we have a randomness factor in the code: the GraphProcessor can generate different distributed query plans for the same query, due to the way we currently pick a connected vertex to be the destination of the SuperDuper step. As mentioned earlier, we currently pick the first vertex we find (and it is random because of the properties of HashSet in Java), but in the future the picker functions should pick more intelligently (e.g. according to a specific heuristic), in which case the distributed query plan will be deterministic.