Home - ProofDrivenQuerying/pdq GitHub Wiki

PDQ logo

Proof-Driven Query Planning

What is PDQ?

PDQ (Proof-Driven Querying) is a platform for dealing with a number of problems involving reasoning and data access. For example, it can be used for generating query plans over semantically-interconnected data sources with diverse access interfaces.

What is PDQ for?

PDQ unifies a number of application scenarios involving querying in the presence of semantic information, where the semantic information can be integrity constraints on the data, or information about the relationship of stored data to accessible data.

One application of PDQ is to a class of plan synthesis problems. For example, PDQ can provide a solution for querying data available through Web-based APIs. For a given query there may be many Web-based sources which can be used to answer it, with the sources overlapping in their vocabularies, and differing in their access restrictions (required arguments) and cost. PDQ can determine if the query can be answered using the data sources, and if so can generate the optimal plan.

PDQ can also be applied to more traditional database planning scenarios, such as optimization of constraints in relational databases in the presence of integrity constraints and query optimization using materialized views. PDQ works by generating query plans from proofs that a query is answerable. The PDQ planner performs optimization via exploring a space of proofs, with each proof corresponding to a different plan.

The scenarios above require reasoning with queries and constraints. Included in PDQ is a module for deriving logical consequences of a database under a set of integrity constraints, where the constraints take the form of logical formulas known as dependencies. This module can be used stand-alone, allowing PDQ to be utilized to solve other problems involving reasoning with dependencies, such as data exchange.

How does PDQ work?

PDQ's planning module takes as input a schema S that consists of relation descriptions (attributes and data types) and integrity constraints that are dependencies. Relations are endowed with access methods, each of which associates a relation with a set of required input attributes. Relations and access methods are assumed to carry cost information, such as selectivity and per-tuple access cost. The second input is a Select-Project-Join query Q. Other inputs to PDQ include parameters to our search strategies and heuristics.

The first step of the planning phase is to augment the input schema S with new relations and with rules that tell how these new relations are related to the original ones. For each relation R in the original schema S, we have a relation InferredAccR that informally denotes all tuples which can be inferred to be in R using access methods and constraints. We also have a relation accessible(x) which holds all values that can be retrieved via accesses. In addition to the new relations, our augmented schema will have new integrity constraints, which we refer to as "accessibility axioms", that capture the informal semantics of relations of the form InferredAccR, and the relation accessible.

A proof is a sequence of facts produced by firing either integrity constraint rules or accessibility axioms. The sequence of facts in the proof will be needed to determine whether a proof is successful and, if it is not yet successful, to determine what new rules could be applied to extend it. PDQ's proof-to-plan algorithm is in charge of producing a plan from a proof. This is achieved by looking at the sequence of accessibility axiom rule firings (and corresponding set of facts) that are contained in the proof. The full set of possible proofs forms a tree. Proof search proceeds by exploring this tree, expanding one branch at a time. As a branch is expanded, the proof is checked to see if it is successful, a plan is generated from the proof and the cost of the plan is computed. The exploration of a branch stops when a successful proof is found or when the cost of the associated plans is too high. For a tree node that corresponds to a not-yet-successful proof, the accessibility axioms that can still be fired represent candidates for extending the tree.

For more details, here are some references that explain the details of several aspects of PDQ:

  • Generating Low-cost Plans From Proofs - Michael Benedikt, Balder ten Cate and Efthymia Tsamoura
    PODS 2014
  • PDQ: Proof-driven Query Answering over Web-based Data - Michael Benedikt, Julien Leblay and Efthymia Tsamoura
    VLDB 2014 (demo)
  • Querying with Access Patterns and Integrity Constraints - Michael Benedikt, Julien Leblay and Efthymia Tsamoura
    VLDB 2015
  • Interpolation with decidable fixpoint logics - Michael Benedikt, Balder ten Cate, and Michael Vanden Boom
    ICALP 2015
  • Generating Plans from Proofs: The Interpolation-based Approach to Query Reformulation - Michael Benedikt, Julien Leblay, Efthymia Tsamoura, and Balder ten Cate
    Book Published By Morgan Claypool
  • Biological Web Services: Integration, Optimization, and Reasoning - Michael Benedikt, Rodrigo Lopez-Serrano, and Efthymia Tsamoura
    IJCAI BAI 2016
  • Reformulating Queries: Theory and Practice - Michael Benedikt, Egor Kostylev, Fabio Mogavero, and Efthymia Tsamoura
    IJCAI 2017
  • Characterizing Definability in Decidable Fixpoint Logics - Michael Benedikt and Pierre Bourhis and Michael Vanden Boom
    ICALP 2017
  • When Can We Answer Queries Using Result-bounded Data Interfaces - Antoine Amarilli and Michael Benedikt
    PODS 2018

How did PDQ come about?

The PDQ project has been run by Michael Benedikt. The developers of the first version of PDQ were Julien Leblay and Efthymia Tsamoura.

The principle developers of the current revamped version of PDQ, PDQ 2.0, are Gabor Gyorkei and Efthymia Tsamoura. This version contains:

  • a cleaner plan language
  • a completely differently internal data management architecture for reasoning, including a home-made main memory database so that PDQ can be used without any external DBMS
  • a simplified and more robust planning architecture,
  • support for more cost functions, including those based on histogram statistics
  • a simplified specification language for web service wrappers
  • support for the chasebench format for TGDs and queries
  • a more robust runtime

The Web GUI and Rest interfaces for PDQ 2.0 were developed by Camilo Ortiz.

Tim Hobson helped in the development of the runtime of PDQ 2.0, while Mark Ridler contributed to the web service wrapper infrastructure and console GUI. George Konstantinidis worked on the revision of the data management architecture and planner, while Stefano Germano worked on the testing, refactoring, and packaging of the software.

The management of the release of PDQ 2.0, which included setting up continuous integration infrastructure, testing, debugging, and re-factoring, was done by Fergus Cooper with assistance from Stefano Germano.

Navigation

Please choose your topic from the sidebar on the right.

⚠️ **GitHub.com Fallback** ⚠️