DDlog Language Features - HazyResearch/ddlog GitHub Wiki
ddlog is a language for writing DeepDive applications in datalog-like syntax. A ddlog program is compiled into the configuration file in deepdive's format, and can be used to run applications.
A ddlog program is composed of statments. The statements are delimited by dots, and the order of the statments is irrelevant. A statement can be a schema declaration, a datalog rule, a function declaration, a function call rule, a supervision rule, or an inference rule.
Schema Declaration
Each schema declaration declares the type of each column of a relation and the order of the columns in a relation. The syntax is like SQL schema:
relation_name(
column1_name column1_type,
column2_name column2_type,
...).
relation_name
is any identifier that denotes a relation name in the database.
Similarly column_name
s and column_type
s are strings that correspond to the
column names and types used in the database.
ddlog uses the same set of types as in SQL.
For example,
sentences(
doc_id text,
sent_id int,
text text,
words text[]).
defines a relation articles
that contains 4 columns: doc_id
, sent_id
, text
,
and words
, and the types of the columns are text
, int
, text
, text[]
, where
text[]
denotes a text
array.
A question mark after the relation name indicates that the relation is a variable relation containing random variables in the factor graph. The columns of the variable relation serve as a key of that relation. For example,
has_spouse?(relation_id text).
defines a variable relation has_spouse
, and each different relation_id
represents
a different variable.
Normal datalog Rules
Typical datalog rules can be used for defining how a relation is derived from other relations. The head is the relation being defined, and the body is a conjunction of conjunctive query bodies, separated by commas. Disjunctive cases are expressed with multiple bodies separated by semicolons.
For example, the following program states that the tuples of Q
are derived
from R
and S
, where the second column of R
is unified with the first column
of S
, i.e., the body is a equi-join between R
and S
.
Q(x, y) :- R(x, y), S(y).
Here, Q(x, y)
is the head, and R(x, y)
and S(y)
are body atoms. x
and y
are variables of the rule.
Expressions
Expressions can be used in the atom columns. Expressions in the head atom will be output to the head relation. Expressions in the body atom will become a condition. Note datalog's semantics is based on unification of variables/expressions: variables with the same name are unified, and an expression in the body atom is unified with its corresponding column. A comprehensive example is shown below.
a(k int).
b(k int, p text, q text, r int).
c(s text, n int, t text).
Q("test", f(123), id) :- a(id), b(id, x,y,z), c(x || y,10,"foo").
Here the ouptut is a tuple of string literal "text", a function f
operating on 123,
and id produced by body.
In the body, the conditions are compiled to
a.k = b.k -- unifying variable id in a(id) and b(id, ...)
AND c.n = 10 -- unifying literal 10 with the column in c(...,10,...)
AND c.t = 'foo' -- unifying literal "foo" with the column in c(...,"foo")
AND b.p || b.q = c.s -- unifying expression x||y with the column in c(x||y,...)
Expressions can be arbitrarily nested.
Conditions
Conditions are put at the same level of body atoms, and collectively with body atoms defined the output to the head. Conjunctive conditions are seperated by comma, and disjunctive conditions are separated by semicolon. Conditions can be arbitrarily nested by putting inside square brackets. For example,
Q(x) :- b(x,y,z,w), [x + w = 100; [!x > 50, w < 10]].
The conditions are (x + w = 100) OR ((NOT x > 50) AND w < 10)
.
Placeholder
A placeholder _
can be used in columns of body atoms, indicating the output is
irrelevant to that column.
Aggregation
Using aggregation functions in the head atom columns indicates doing a group by on the rest of columns, and an aggregate . Supported aggregation functions are SUM, MAX, MIN, ARRAY_AGG, COUNT. For example,
Q(a,b,MAX(c)) :- R(a,b,c).
will group by the first and second columns of R
, and take the max of the third column.
Supervision Rule
A supervision rule is used for scoping and supervising the random variables, i.e.,
defining variable set and training data.
The syntax is the same as a normal datalog rule, with an annotaion
@label(column_variable)
. For example,
@label(is_true)
has_spouse(rid) :- has_spouse_candidates(_, _, _, _, rid, is_true).
This rule supervises random variables in has_spouse
using is_true
column in
has_spouse_candidates
.
Inference Rule
An inference rule is used for expressing how to infer random variables.
An inference rule's head formula defines a factor (hyper-edge) in the factor graph.
Atoms in the head must be variable relations. The weight tying is expressed before
the rule itself using @weight(expression)
syntax, where expression
can be any
expression.
For example,
@weight(3)
smoke(x) => cancer(x) :- person(x).
@weight(feature)
smoke(x) = smoke(y) :- friend(x, y, feature).
The first rule expresses that a person smokes implies he/she has a cancer.
The second rule expresses that if x
and y
are friends, they both smoke or neither
smokes, and the weight is tying to the feature column in friend
relation.
The weight can also be a string constant, meaning the rule's factors all tie to
the same weight.
Supported deepdive factor function and corresponding syntax of head are:
Imply : A1, A2, ... => An
Equal : A1 = A2 = ... = An
And : A1 ^ A2 ^ ... ^ An
Or : A1 v A2 v ... v An
IsTrue : A
Multinomial: Multinomial(A1, A2, ..., An)
where A
s are atoms.
Compiling
Given the ddlog compiler ddlog.jar
, a ddlog program app.ddlog
can be compiled
using
java -jar ddlog.jar compile app.ddlog > deepdive.conf
Extensions
This section describes more expressivity of ddlog's conjunctive query.
Quantified Body
A quantified body indicates a quantified condition. The syntax is
quantifier(atom_or_condition, ...)
Supports quantifier are EXISTS
,
ALL
, OPTIONAL
. The first two resemble corresponding quantifiers in SQL, the third
corresponding to a outer relation in an outer join in SQL. For example,
Q(a) :- R(a,b), EXISTS[S(c, a)].
Q(a) :- R(a,b), EXISTS[S(c, d), b > d].
Q(a) :- R(a,b), OPTIONAL[S(c, d), a > d].
Function Declaration
Each function declaration declares the types of the column of the input and output of a user-defined function (UDF) as well as how its implementation can be executed. Syntax:
function function_name function_input_type function_return_type implemetation
function
is the key word, function_type
and function_return_type
can be
over relation_type_declaration
or
returns relation_type_declaration
relation_type_declaration
can either be a list of column name and types, or
rows like relation_name
to use the same schema as in the reltion. implementation
is
implementation "path_to_udf" handles tsv lines
to indicate a tsv function, or handles tsv lines
replaced by runs as plpy
to
indicate a plpy
extractor function.
For example,
Q(x int).
function ext_people over (a int, b int) returns like Q
implementation "udf/ext_people.py" handles tsv lines.
Function Call Rule
The functions declared as above can be used only in a function call rule. UDF functions can derive a relation from a conjunctive query. The syntax is:
output_relation += user_defined_function(x, y) :- relation1(x, y), relation2(y, z).
The tuples (x, y)
from body relations will be input to user_defined_function
,
and the ouput goes into output_relation
.
For example,
has_spouse_candidates += ext_has_spouse(s, p1_id, p1_text, p2_id, p2_text) :-
people_mentions(s, _, _, p1_text, p1_id),
people_mentions(s, _, _, p2_text, p2_id).