DDlog Language Features - HazyResearch/ddlog GitHub Wiki

ddlog is a language for writing DeepDive applications in datalog-like syntax. A ddlog program is compiled into the configuration file in deepdive's format, and can be used to run applications.

A ddlog program is composed of statments. The statements are delimited by dots, and the order of the statments is irrelevant. A statement can be a schema declaration, a datalog rule, a function declaration, a function call rule, a supervision rule, or an inference rule.

Schema Declaration

Each schema declaration declares the type of each column of a relation and the order of the columns in a relation. The syntax is like SQL schema:

relation_name(
  column1_name  column1_type,
  column2_name  column2_type,
  ...).

relation_name is any identifier that denotes a relation name in the database. Similarly column_names and column_types are strings that correspond to the column names and types used in the database. ddlog uses the same set of types as in SQL.

For example,

sentences(
  doc_id  text,
  sent_id int,
  text    text,
  words   text[]).

defines a relation articles that contains 4 columns: doc_id, sent_id, text, and words, and the types of the columns are text, int, text, text[], where text[] denotes a text array.

A question mark after the relation name indicates that the relation is a variable relation containing random variables in the factor graph. The columns of the variable relation serve as a key of that relation. For example,

has_spouse?(relation_id text).

defines a variable relation has_spouse, and each different relation_id represents a different variable.

Normal datalog Rules

Typical datalog rules can be used for defining how a relation is derived from other relations. The head is the relation being defined, and the body is a conjunction of conjunctive query bodies, separated by commas. Disjunctive cases are expressed with multiple bodies separated by semicolons.

For example, the following program states that the tuples of Q are derived from R and S, where the second column of R is unified with the first column of S, i.e., the body is a equi-join between R and S.

Q(x, y) :- R(x, y), S(y).

Here, Q(x, y) is the head, and R(x, y) and S(y) are body atoms. x and y are variables of the rule.

Expressions

Expressions can be used in the atom columns. Expressions in the head atom will be output to the head relation. Expressions in the body atom will become a condition. Note datalog's semantics is based on unification of variables/expressions: variables with the same name are unified, and an expression in the body atom is unified with its corresponding column. A comprehensive example is shown below.

a(k int).
b(k int, p text, q text, r int).
c(s text, n int, t text).

Q("test", f(123), id) :- a(id), b(id, x,y,z), c(x || y,10,"foo").

Here the ouptut is a tuple of string literal "text", a function f operating on 123, and id produced by body. In the body, the conditions are compiled to

a.k = b.k -- unifying variable id in a(id) and b(id, ...)
AND c.n = 10  -- unifying literal 10 with the column in c(...,10,...)
AND c.t = 'foo' -- unifying literal "foo" with the column in c(...,"foo")
AND b.p || b.q = c.s  -- unifying expression x||y with the column in c(x||y,...)

Expressions can be arbitrarily nested.

Conditions

Conditions are put at the same level of body atoms, and collectively with body atoms defined the output to the head. Conjunctive conditions are seperated by comma, and disjunctive conditions are separated by semicolon. Conditions can be arbitrarily nested by putting inside square brackets. For example,

Q(x) :- b(x,y,z,w), [x + w = 100; [!x > 50, w < 10]].

The conditions are (x + w = 100) OR ((NOT x > 50) AND w < 10).

Placeholder

A placeholder _ can be used in columns of body atoms, indicating the output is irrelevant to that column.

Aggregation

Using aggregation functions in the head atom columns indicates doing a group by on the rest of columns, and an aggregate . Supported aggregation functions are SUM, MAX, MIN, ARRAY_AGG, COUNT. For example,

Q(a,b,MAX(c)) :- R(a,b,c).

will group by the first and second columns of R, and take the max of the third column.

Supervision Rule

A supervision rule is used for scoping and supervising the random variables, i.e., defining variable set and training data. The syntax is the same as a normal datalog rule, with an annotaion @label(column_variable). For example,

@label(is_true)
has_spouse(rid) :- has_spouse_candidates(_, _, _, _, rid, is_true).

This rule supervises random variables in has_spouse using is_true column in has_spouse_candidates.

Inference Rule

An inference rule is used for expressing how to infer random variables. An inference rule's head formula defines a factor (hyper-edge) in the factor graph. Atoms in the head must be variable relations. The weight tying is expressed before the rule itself using @weight(expression) syntax, where expression can be any expression. For example,

@weight(3)
smoke(x) => cancer(x) :- person(x).
@weight(feature)
smoke(x)  =  smoke(y) :- friend(x, y, feature).

The first rule expresses that a person smokes implies he/she has a cancer. The second rule expresses that if x and y are friends, they both smoke or neither smokes, and the weight is tying to the feature column in friend relation. The weight can also be a string constant, meaning the rule's factors all tie to the same weight. Supported deepdive factor function and corresponding syntax of head are:

Imply  : A1, A2, ... => An
Equal  : A1 = A2 = ... = An
And    : A1 ^ A2 ^ ... ^ An
Or     : A1 v A2 v ... v An
IsTrue : A
Multinomial: Multinomial(A1, A2, ..., An)

where As are atoms.

Compiling

Given the ddlog compiler ddlog.jar, a ddlog program app.ddlog can be compiled using

java -jar ddlog.jar compile app.ddlog > deepdive.conf

Extensions

This section describes more expressivity of ddlog's conjunctive query.

Quantified Body

A quantified body indicates a quantified condition. The syntax is

quantifier(atom_or_condition, ...)

Supports quantifier are EXISTS, ALL, OPTIONAL. The first two resemble corresponding quantifiers in SQL, the third corresponding to a outer relation in an outer join in SQL. For example,

Q(a) :- R(a,b), EXISTS[S(c, a)].
Q(a) :- R(a,b), EXISTS[S(c, d), b > d].
Q(a) :- R(a,b), OPTIONAL[S(c, d), a > d].

Function Declaration

Each function declaration declares the types of the column of the input and output of a user-defined function (UDF) as well as how its implementation can be executed. Syntax:

function function_name function_input_type function_return_type implemetation

function is the key word, function_type and function_return_type can be

over relation_type_declaration

returns relation_type_declaration

relation_type_declaration can either be a list of column name and types, or rows like relation_name to use the same schema as in the reltion. implementation is

implementation "path_to_udf" handles tsv lines

to indicate a tsv function, or handles tsv lines replaced by runs as plpy to indicate a plpy extractor function. For example,

Q(x int).

function ext_people over (a int, b int) returns like Q
  implementation "udf/ext_people.py" handles tsv lines.

Function Call Rule

The functions declared as above can be used only in a function call rule. UDF functions can derive a relation from a conjunctive query. The syntax is:

output_relation += user_defined_function(x, y) :- relation1(x, y), relation2(y, z).

The tuples (x, y) from body relations will be input to user_defined_function, and the ouput goes into output_relation. For example,

has_spouse_candidates += ext_has_spouse(s, p1_id, p1_text, p2_id, p2_text) :-
  people_mentions(s, _, _, p1_text, p1_id),
  people_mentions(s, _, _, p2_text, p2_id).