sql_extensions - IITDBGroup/gprom GitHub Wiki
Users can ask for the provenance of a query by using the new PROVENANCE OF construct. E.g.,
PROVENANCE OF (SELECT * FROM r);Such an expression will be rewritten into a instrumented query that returns the provenance of the input query. The result of a query instrumented for provenance capture is the original query result paired with provenance. Additional attributes are added to the result schema to store the provenance. For example, if SELECT * FROM r returns
a | b | c
---+---+---
1 | 2 | 3
1 | 3 | 4
2 | 5 | 3
2 | 1 | 7
3 | 4 | 8
3 | 5 | 9
then PROVENANCE OF (SELECT * FROM r); will return:
a | b | c | prov_r_a | prov_r_b | prov_r_c
---+---+---+------------------+------------------+------------------
1 | 2 | 3 | 1 | 2 | 3
1 | 3 | 4 | 1 | 3 | 4
2 | 5 | 3 | 2 | 5 | 3
2 | 1 | 7 | 2 | 1 | 7
3 | 4 | 8 | 3 | 4 | 8
3 | 5 | 9 | 3 | 5 | 9
Obviously this is not very useful for SELECT * ... style queries. Consider a slightly more complex query using aggregation over the table from above:
SELECT a, sum(b) AS x
FROM r
GROUP BY a;This query returns:
a | x
---+---
1 | 5
2 | 6
3 | 9
To compute the provenance of this query run:
PROVENANCE OF (
SELECT a, sum(b) AS x
FROM r
GROUP BY a
);which returns:
a | x | prov_r_a | prov_r_b | prov_r_c
---+---+---+------------------+------------------+------------------
1 | 5 | 1 | 2 | 3
1 | 5 | 1 | 3 | 4
2 | 6 | 2 | 5 | 3
2 | 6 | 2 | 1 | 7
3 | 9 | 3 | 4 | 8
3 | 9 | 3 | 5 | 9
A PROVENANCE OF request is treated as a query construct, e.g., it can be used as a subquery.
SELECT *
FROM
PROVENANCE OF (SELECT * FROM r) AS p
WHERE prov_r_a = 3;To have Gprom add an tuple identifier to each row in the original query's result add the WITH RESULTTIDS row.
PROVENANCE WITH RESULTTIDS OF (...)We enable the user to choose how far to trace back provenance and to instruct our system to use certain attributes as provenance.
By using the BASERELATION keyword after a FROM clause item within a provenance statement, the user instructs the system to not trace provenance into the subquery, but instead treat it as an input relation.
PROVENANCE OF (
SELECT month, total
FROM (SELECT month, creditc, SUM(amount) AS total
FROM purchase p
GROUP BY month, creditc) BASERELATION AS monthly
);The query above would return explain an output tuple based on the tuples from subquery monthly that were used to derive it.
By using the HAS PROVENANCE (attrList) expression after a FROM clause item the user instructs the system to consider these attributes as the provenance of the from clause item
PROVENANCE OF (
SELECT sum(a)
FROM R HAS PROVENANCE (myProvAttr1, myProvAttr2)
GROUP BY b
);Similarly, if only certain attributes should be duplicated as provenance this can be specified by using USE PROVENANCE (attrList):
PROVENANCE OF (
SELECT sum(a)
FROM R USE PROVENANCE (a)
GROUP BY b
);The user can also ask the system to show provenance for intermediate results of a query.
PROVENANCE OF (
SELECT count(*) AS highTotals
FROM
(SELECT sum(amount) AS total
FROM accounts
GROUP BY user
HAVING total > 10000) SHOW INTERMEDIATE PROVENANCE AS sub
);In this example, each tuple in the result would be paired with its provenance in the accounts table and with its provenance in the results of the subquery sub. Consider the following accounts table:
usr | amount
--------+--------
Peter | 6000
Peter | 8000
Peter | 50
Astrid | 100
Astrid | 100
Astrid | 9900
The result would be
hightotals | prov_sub_total | prov_accounts_usr | prov_accounts_amount
------------+----------------+--------------------------+-----------------------------
2 | 14050 | Peter | 6000
2 | 14050 | Peter | 8000
2 | 14050 | Peter | 50
2 | 10100 | Astrid | 100
2 | 10100 | Astrid | 100
2 | 10100 | Astrid | 9900
GProM supports evaluating provenance for SPJ queries in several semirings. You should make sure to have a single attribute in the input tables (or in a table created by a subquery) that store the annotations you want to process and that you inform gprom about it using HAS PROVENANCE. You can select one of the supported semirings:
-
N: natural number semiring, -
NX: provenance polynomials, -
TROPICAL: tropical semiring -
VITERBI: Viterbi semiring by adding theWITH SEMIRING COMBINERclause:
PROVENANCE WITH SEMIRING COMBINER NX OF (
SELECT 1
FROM r_nx HAS PROVENANCE (prov),
s_nx HAS PROVENANCE (prov)
);Alternatively, you can specify the addition and multiplication operations. The addition operation should be an expression involving an aggregation function over a single attribute (the naming is irrelevant). The multiplication operation is a scalar expression over two variables. For instance, for the natural numbers semiring:
PROVENANCE WITH SEMIRING COMBINER ADD (sum(k)) MULT (k + l) OF (...)As an example consider this session using the default gprom duckdb database (blackboxtests/testdb/testDB-sqlite.sql):
Oracle SQL - DuckDB:./blackboxtests/test-duckdb.db$SELECT * FROM r_nx;
A | B | PROV |
---------------
1 | 1 | x1 |
1 | 2 | x2 |
2 | 1 | x3 |
2 | 3 | x4 |
Oracle SQL - DuckDB:./blackboxtests/test-duckdb.db$SELECT * FROM s_nx;
C | D | PROV |
---------------
2 | 2 | y1 |
2 | 3 | y2 |
3 | 2 | y3 |
1 | 4 | y4 |
Oracle SQL - DuckDB:./blackboxtests/test-duckdb.db$PROVENANCE WITH SEMIRING COMBINER NX OF (SELECT 1 FROM r_nx HAS PROVENANCE (prov), s_nx HAS PROVENANCE (prov));
1 | PROV |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | (x1 * y1) + (x2 * y1) + (x3 * y1) + (x4 * y1) + (x1 * y2) + (x2 * y2) + (x3 * y2) + (x4 * y2) + (x1 * y3) + (x2 * y3) + (x3 * y3) + (x4 * y3) + (x1 * y4) + (x2 * y4) + (x3 * y4) + (x4 * y4) |
Oracle SQL - DuckDB:./blackboxtests/test-duckdb.db$Using the AS OF clause the user can request the provenance of a query as it would have been if the query would have been executed at some time in the past. AS OF can either be specified as a timestamp TIMESTAMP '2013-12-01 08:00:00' or using an SCN SCN 12345. This is currently only supported for the Oracle backend.
PROVENANCE AS OF TIMESTAMP '2013-12-01 08:00:00' OF (SELECT * FROM r);GProM also can retroactively capture the provenance of a transaction. This is currently only supported for Oracle backends and require Oracles Flashback Archive (FBA) and auditing features to be activated. Auditing is used to access the SQL statements executed by the transaction and FBA is used for time travel queries.
To capture the provenance, you have to know the transaction XID (transaction identifier) of a transaction that was executed at some point in the past. To get the internal identifier of a transaction use Oracle's audit logging facility https://blogs.oracle.com/imc/entry/oracle_database_12c_new_unified: For instance, to compute provenance for transaction 12345 run:
PROVENANCE OF TRANSACTION 12345;The user can also select the table modified by the transaction for which provenance should be tracked. The default is the last table updated by the transaction.
If you want to compute the provenance of a different table modified by the transaction then you have to use the WITH TABLE X option:
PROVENANCE WITH TABLE S OF TRANSACTION 12345;To restrict the results to rows modified by the transaction, add the ONLY UPDATED option:
PROVENANCE WITH ONLY UPDATED OF TRANSACTION 12345;Furthermore, to show intermediate versions of the tracked table as seen by each update of the transaction add the SHOW INTERMEDIATE option:
PROVENANCE WITH SHOW INTERMEDIATE OF TRANSACTION 12345;When the STATEMENT ANNOTATIONS option is given then additional boolean attributes are added to the schema which store for each row which of the transaction's updates did modify this row:
PROVENANCE WITH STATEMENT ANNOTATIONS OF TRANSACTION 12345;GProM also supports reenactment of a list of statements provided by the user. These statements can either be executed over the current database state or over a past state (currently this is only works for Oracle backends if Oracle's time travel features are activated). For example, to get the state of relation R produced by the updates:
UPDATE R SET a = a + 2 WHERE b = 5;
DELETE FROM R WHERE b > 3;)you would have to use the reenactment request:
REENACT (UPDATE R SET a = a + 2 WHERE b = 5; DELETE FROM R WHERE b > 3;);The reenact statement can also be used to track provenance for the reenacted operations:
REENACT WITH PROVENANCE (UPDATE R SET a = a + 2 WHERE b = 5; UPDATE R SET b = 6 WHERE b <= 5;);For reenacting with provenance, we also support the options influencing provenance tracking as described above for provenance capture for transactions, e.g.,
REENACT WITH PROVENANCE ONLY UPDATED SHOW INTERMEDIATE STATEMENT ANNOTATIONS (UPDATE R SET a = 2 WHERE B = 1; INSERT INTO R VALUES (10,20););GProM also has limited support for DDL commands in reenactment, e.g.,
REENACT (CREATE TABLE mytab (a int, b int); INSERT INTO mytab VALUES (1,1); INSERT INTO mytab VALUES (2,3); UPDATE mytab SET a = a + 2 WHERE b = 3;);would yield:
A|B|
----
1|1|
4|3|
Note that provenance tracking for DDL commands is currently not supported yet.
