PDQ sub projects - ProofDrivenQuerying/pdq GitHub Wiki

PDQ is composed by 8 sub-projects:

    common           cost       datasources        gui           planner      reasoning     regression      runtime   

Overview

  • the common sub-project and the datasources can only be used as libraries
  • the cost sub-project has functionality for estimating the cost of a plan
  • the gui sub-project runs the PDQ Graphical User Interface
  • the planner sub-project takes as input a schema and a query and produces as output a plan
  • the reasoning sub-project has two modalities:
    • takes as input a schema an external list of facts and produces as output the chase of the input program and data
    • takes as input a schema some facts and a query and produces as output the certain answers of the query
  • the regression takes as input a folder and is passed as an argument (-m) a mode (planner, runtime or end-to-end). It executes the corresponding module or modules producing the output
  • the runtime sub-project takes as input a schema , a plan and executable access method descriptors and it runs the plan, producing as output the answers to the query

Common Common

The common sub-project features the packages and classes used across all of the other PDQ subprojects. It features high-level interfaces and basic code for handling FOL formulas, and basic database objects like tuples and relations. It contains the java representation of PDQ's plan language, an extension of relational algebra with support for access methods.

The most involved part of the common subproject is the data management infrastructure used internally in PDQ for reasoning. This includes a simple database manager that implements a fragment of SQL, and an interface to an external relational database management system.

Cost Cost

The cost sub-project features the packages and classes for plan cost estimation. Many of the cost functions will make use of statistical information, with some simple statistics being recorded in a catalog.properties file. The main entry point will take as input a plan.xml file (in the format described here. The type of estimation algorithm used and the filename holding the catalog information is passed within a parameters file.

Datasources Datasources

The datasources sub-project features both metadata discovery tools and classes representing descriptions of access method implementations, via web services and dbms connector libraries.

One use of the datasource package is to read in metadata from a DBMS, populating a schema object, which can be serialized to a schema.xml file or used within the application for planning, reasoning, or runtime. Only Postgres and MySQL are supported for such extraction. An example of the use of the API to extract schema information is:

public PostgresqlSchemaDiscoveryTest() throws BuilderException {
                Properties properties = new Properties();
		properties.put("url", "jdbc:postgresql://localhost/");
		properties.put("database", "tpch");
		properties.put("username", "root");
		properties.put("username", "postgres");
		properties.put("password", "root");
		properties.put("driver","org.postgresql.Driver");		
		PostgresqlSchemaDiscoverer disco = new PostgresqlSchemaDiscoverer();
		disco.setProperties(properties);
		this.schema = disco.discover();
	}

There is currently no code for building catalog objects with cardinality information (or a catalog.properties file) -- this has to be created offline.

For web services, there is no code that processes standard descriptions of services. But the datasources sub-project contains code that processes our own meta-data about web services and their association with access method identifiers.

A more important piece of functionality in the datasources sub-project concerns stroing access information about a DBMS or a webservice, which is utilized by the runtime project to implement primitive access commands. An example of a webservice access description can be found here. The datasources project has the code that parses these descriptions, creating corresponding objects, which are used by the runtime.

GUI GUI

The gui sub-project contains the PDQ Graphical User Interface a Java console app. There is also a web-based GUI, and you can find a demo of it here.

Planner Planner

The planner sub-project features classes and algorithms for planning purposes. The project implements a number of planning algorithms. All of these algorithms make use of the functions from the reasoning sub-project to generate consequences of the input query and to check whether a plan is logically equivalent to the original query. They also make use of the cost estimation functions in the cost sub-project to estimate the cost of a plan.

Reasoning Reasoning

The reasoning sub-project features classes and algorithms for reasoning purposes, particularly getting the answers to a conjunctive query that are entailed by a dataset and constraints. Currently, the only constraints supported are tuple-generating dependencies, and the only implemented reasoning algorithm is the chase. Reasoning makes use of the common sub-project, and in particular depends on the reasoning database infrastructure. Reasoning can be done on top of either the home made main-memory database or an external RDMS.

Regression Regression

The regression sub-project features the packages and classes for regression testing in the code. The particular subproject or subprojects being tested (planner, runtime, end-to-end=planner and then runtime) are specified using a command line argument. The input files will be the inputs appropriate to that subproject (e.g. a plan xml file for the runtime, which must be called expected-plan.xml). Regression is applied to a folder which can contain multiple test cases. The regression tester will descend into the folder and test all cases. For each case output files will be generated when appropriate (see below) and information about whether the test passes or fails is sent to standard output.

For planner, in addition to the usual inputs to the planner (e.g. a query) there is also an expected-plan.xml file, which describes the output that is expected. The actual output of the planner will be compared with the expected output via some acceptance criterion. The criteria currently are based on the cost function being used in the planner: the cost function is applied to the output plan and also to the prior plan in expected-plan.xml. The comparison will require either an exact match of the cost, or an `approximate match' (within one order of magnitude). This currently depends upon what optimizations are enabled in the PDQ planner.

For runtime, there is no expected output file, only an expected cardinality, which is given in the parameters file (case. properties).

Running regression with planner mode overwrites the expected-plan.xml file. Running regression with the runtime does not record the actual plan output at all, since the output is not compared exactly. As with planner regression, there is information about whether the tests pass, and this is sent to standard output.

A run of regression produces a summary file which encodes the time of the run, placed in a subdirectory TestResults/current relative to the directory in which the application is run.

Runtime Runtime

The runtime sub-project features packages and classes to support runtime execution of plans. This can be used in conjunction with the planner (that is, one creates a plan from the planner and then runs it with the runtime), or standalone. The runtime implements the main operators of the plan language, such as joins. For accessing data it implements the descriptions of access methods, where the specification of those descriptions is in the datasources package. For example, if a datasource has an access method associated with a webservice, a datasource description describes how to generate an associate url; the runtime code will make a call to that url and process the resulting output, converting it into tuples.