User level Input files - ProofDrivenQuerying/pdq GitHub Wiki
PDQ's executables are run with three types of files.
This portion of the guide explains the format of the user-level input files.
A schema
file is always required for each PDQ subproject, while other files such as query
or plan
may also be required depending on the particular component. For example, planner
expects a query and a schema, outputting a plan, while runtime
expects a plan and a schema, outputting data.
Schema files have the following structure;
<schema>
<relations>...</relations>
<dependencies>...</dependencies>
</schema>
contain a sequence of elements that are either or content.
The <relations> element contains a list of <relation> and <view> elements.
A <relation> element contains a list of <attribute> elements, each listing an attribute name and type. The ordering of the attributes associates each attribute with a position, with the first attribute in the list associated with position 0. A <relation> can also contain a list of access method elements, with each element having a name and optionally a string telling the input positions.
Consider the following example schema;
<relations>
<relation name="R">
<attribute name="a" type="java.lang.Integer"/>
<attribute name="b" type="java.lang.Integer"/>
<attribute name="c" type="java.lang.Integer"/>
<access-method name="m1"/>
</relation>
<relation name="S">
<attribute name="b" type="java.lang.Integer"/>
<attribute name="c" type="java.lang.Integer"/>
<access-method name="m2" inputs="0"/>
</relation>
</relations>
<dependencies/>
It declares two relations, R with attributes a, b, and c, and S with attributes b and c. S has an access method m2 that takes as input position 0, which corresponds to attribute b. Informally, we can do use m2 to do a lookup on attribute S with a value for attribute b. Relation R has an access method m1 that requires no input -- thus invoking m1 will return the entire content of relation R. Note that these access method descriptions in the schema file are abstract, in that they do not tell how the lookup is implemented. For planning purposes, the abstract description is all that is needed. For runtime purposes, we will require an additional data source description file that describes the 'concrete implementation' of each access method.
A <view> element is structured identically with a <relation>. The only difference between a view element and a relation is that a view element will have associated view definition dependencies in the <dependencies> list.
Dependencies describe integrity constraints on the relations. PDQ currently supports Equality-Generating Dependencies (EGDs) and Tuple-generating Dependencies (TGDs).
A TGD is a logical sentence of the form ∀x1 ... xm [ A1(...) ∧ ... Aj(...) ➞ ∃ y1 ... yn (B1(...) ∧ ... Bj(...)] where each Ai and Bi is an atom (e.g. R(x1, x2, `joe') ). An example of a TGD is a referential constraint, saying that every entry in one relation matches a relation in another relation. For example if we have an Employee relation storing ids of employees and a WorksIn relation storing which employees work in which department, then
∀ e n [ Employee(e) ➞ ∃ d WorksIn(e,d)]
is a TGD stating that every employee works in some department. The conjunction of atoms on the left side is called the body of the TGD, while the conjunction of atoms on the right hand side are called the head of the TGD.
An EGD is a logical sentence of the form ∀x1 ... xm [ A1(...) ∧ ... Aj(...) ➞ xi=xj where each Ai and Bi is an atom (e.g. R(x1, x2, `joe') ). An example of an EGD is a key constraint. For example, if we have an integrity constraint that every employee can only work in one department, we could express it as an EGD:
∀e d [WorksIn(e, d) ∧ WorksIn(e, d') ➞ d=d']
TGDs and EGDs are captured in XML in a straightforward way: we just list the body and the head, since the quantifiers and connectives can be inferred from these. For example, the TGD ∀ e n [ Employee(e) ➞ ∃ d WorksIn(e,d)] would be captured as:
Note that as a convenience, we allow the name attribute to be omitted from a variable. This means that the variable is distinct from all other variables in the dependency. For example:
<dependency>
<body>
<atom name="R">
<variable name="x" />
<variable name="y" />
</atom>
<atom name="S">
<variable />
<constant type="java.lang.Integer" value="1"/>
<variable name="y" />
</atom>
</body>
<head>
<atom name="S">
<variable name="y" />
<constant type="java.lang.Integer" value="2"/>
<variable />
</atom>
</head>
</dependency>
expresses the TGD:
∀x y [R(x, y) ∧ S(z,1, y) ➞ ∃ z S(y, 2, z)]
Every <view> element declared in the <relations> content of the schema should have two dependencies corresponding to the view definition. For example, suppose the <relations> element contains:
<view name="ProjectR">
<attribute name="a" type="java.lang.Integer"/>
<attribute name="b" type="java.lang.Integer"/>
<view>
and suppose we intend for ProjectR to be a view representing the projection of relation R on attributes a and b. Then we should have dependencies
<dependency>
<body>
<atom name="R">
<variable name="a" />
<variable name="b" />
<variable name="c" />
</atom>
</body>
<head>
<atom name="ProjectR">
<variable name="a" />
<variable name="b" />
</atom>
</head>
</dependency>
<dependency>
<body>
<atom name="ProjectR">
<variable name="a" />
<variable name="b" />
</atom>
</body>
<head>
<atom name="R">
<variable name="a" />
<variable name="b" />
<variable name="c" />
</atom>
</head>
</dependency>
PDQ supports user queries that are Conjunctive Queries (CQs): logical expressions of the form
{ x1... xj | ∃ y1 ... yk A1(....) ∧ ..... ∧ Am(....) }
For example, the query asking for all employees who worked in the same department as employee with id number 22 would be written as a CQ
{ e | ∃ d WorkedIn(22, d) ∧ WorkedIn(e,d) }
PDQ's XML syntax for defining queries follows that of dependencies, except the head will just consist of a name for the query and a sequence of variables. The example above would be written in XML as:
<query>
<body>
<atom name="WorkedIn">
<constant type="java.lang.Integer" value="22"/>
<variable name="d" />
</atom>
<atom name="WorkedIn">
<variable name="e" />
<variable name="d" />
</atom>
</body>
<head name="Q">
<variable name="e" />
</head>
</query>
As with dependencies, the variable name can be omitted, which means that it is distinct from all others.
The PDQ planner module outputs a plan annotated with an expected cost. The PDQ runtime module will take as input a plan and execute it, assuming that every abstract access method in the plan has a corresponding implementation. The plans are written in a variation of relational algebra, built up from basic accesses that use one of the access methods via renamings, joins, and projections. The relational algebra terms are encoded in XML in a straightforward way. For example, here is a plan annotated with a cost of 10. The plan does an access to a relation Udirectory using an access method that requires no input. The output of the access is a set of rows in Udirectory, which are renamed to be c0 and c2. This output is used to make an access to relation Profinfo, using an access method that requires a value for position 0 of Profinfo.
Note that currently we include in each element occurring within the plan a list of all the attributes and all the access methods on that relation.
<RelationalTermWithCost type="DependentJoinTerm">
<RelationalTerm type="RenameTerm">
<renamings name="c0" type="java.lang.String"/>
<renamings name="c2" type="java.lang.String"/>
<RelationalTerm type="AccessTerm">
<accessMethod name="m3"/>
<relation name="Udirectory">
<attribute name="employeeid" type="java.lang.String"/>
<attribute name="lastname" type="java.lang.String"/>
<access-method name="m3"/>
</relation>
</RelationalTerm>
</RelationalTerm>
<RelationalTerm type="RenameTerm">
<renamings name="c0" type="java.lang.String"/>
<renamings name="c1" type="java.lang.String"/>
<renamings name="c2" type="java.lang.String"/>
<RelationalTerm type="AccessTerm">
<accessMethod name="m2" inputs="0"/>
<relation name="Profinfo"\>
<attribute name="employeeid" type="java.lang.String"/>
<attribute name="officenumber" type="java.lang.String"/>
<attribute name="lastname" type="java.lang.String"/>
<access-method name="m2" inputs="0"/>
</relation>
</RelationalTerm>
</RelationalTerm>
<cost value="10.0" type="DoubleCost"/>
</RelationalTermWithCost>
For cost module estimates the cost of a plan; estimators for the cost of a plan are utilised in the planning module. Estimators can make use of an addition input file, a catalog file, that has information about the estimated number of entry in a relation, or the number of values that will be returned by a particular access method.
Schema and catalog files provide metadata for datasources. These sources could be virtual or accessible via web services or an SQL API. But data can also be described directly in PDQ's only input format. PDQ uses a straightforward CSV format for data: each line describes a row, and a row is described by listing the values of each attribute separated by commas.
The reasoning module of PDQ also supports a textual format for schemas and queries, arising from the Chasebench project. You can read more about this here.