tutorial - IITDBGroup/gprom GitHub Wiki

Quick Tutorial

Installation

The wiki has detailed installation instructions. In a nutshell, GProM can be compiled with support for different database backends and is linked against the C client libraries of these database backends. The installation follows the standard procedure using GNU build tools. If you just want to try out the system we recommend to build with SQLite support only, since this requires the least amount of setup. Checkout the git repository, install all dependencies and run (alternatively use the docker image iitdbgroup/gprom):

./autogen.sh
./configure --enable-sqlite
make
sudo make install

Your First GProM Session

To use gprom, the interactive shell of GProM, you will need to have one of the supported backend databases installed. For casual use cases, you can stick to SQLite. However, to fully exploit the features of GProM, you should use Oracle. When starting gprom, you have to specify connection parameters to the database. For example, using one of the convenience wrapper scripts that ship with GProM, you can connected to a test SQLite database included in the repository by running the following command in the main source folder after installation:

./scripts/gprom-sqlite.sh -db ./examples/test.db

will start the shell connecting to an SQLite database ./examples/test.db. If GProM is able to connect to the database, then this will spawn a shell like this:

GProM Commandline Client
Please input a SQL command, '\q' to exit the program, or '\h' for help
======================================================================

Oracle SQL - SQLite:./examples/test.db$

In this shell you can enter SQL and utility commands. The shell in turn will show you query results (just like your favorite DB shell). However, the main use of GProM is on-demand capture of provenance for database operations. You can access this functionality through several new SQL language constructs supported by GProM. Importantly, these language constructs behave like queries and, thus, can be used as part of more complex queries. Assume you have a table R(A,B), let us ask our first provenance query.

Oracle SQL - SQLite:./examples/test.db$SELECT * FROM r;
 A | B |
--------
 1 | 1 |
 2 | 3 |
 4 | 1 |
 1 | 1 |
 2 | 3 |
 4 | 1 |
Oracle SQL - SQLite:./examples/test.db$PROVENANCE OF (SELECT a FROM r);
 A | PROV_R_A | PROV_R_B |
--------------------------
 1 | 1        | 1        |
 2 | 2        | 3        |
 4 | 4        | 1        |
 1 | 1        | 1        |
 2 | 2        | 3        |
 4 | 4        | 1        |

As you can see, PROVENANCE OF (q) returns the same answer as query q, but adds additional provenance attributes. These attributes store for each result row of the query the input row(s) which where used to compute the output row. For example, the query result (1) was derived from row (1,1) in table R. For now let us close the current session using the \q utility command:

Oracle SQL - SQLite:./examples/test.db$ \q

What to do next?

Provenance for SQL queries is only one of the features supported by GProM.

  • A full list of provenance SQL language extensions supported by GProM can be found in the wiki.
  • See the man page of gprom for further information how to use the CLI of the system.
  • GProM can also generate provenance graphs for why and why-not questions over Datalog queries. This page documents the Datalog dialect supported by GProM see here. For some examples of these provenance graphs and the research behind them see Datalog Provenance.
  • GProM uses relational algebra as an intermediate code representation and features a heuristic and cost-based optimizer for such expressions.
  • GProM can compute provenance for transactions using reenactment and can reenact sequences of updates and DDL commands. The SQL language extensions implementing these features are explained here. For a more scientific explanation of reenactment see here.
  • GProM can evaluate temporal queries under snapshot-reducible semantics. See here.
  • GProM can bounds the possible answers of queries over uncertain and dirty data to certify the robustness of query answers and highlight parts of the data that need more attention / work. See here.
  • For convenient setup of GProM have a look at the docker containers we provide: docker