uncertain - IITDBGroup/gprom GitHub Wiki

Uncertain Data and Querying Dirty Data

GProM supports querying data that is uncertain or dirty producing conservative answers bounding all possible outcomes and providing an under-approximation of what is guaranteed to be true (certain answers). This functionality is based on Uncertainty-annnotated database (UA-DB) model where uncertainty is encoded using intervals instead of concrete values and bounds on the multiplicities (number of duplicates of tuples). This SIGMOD blog article provides a gentle introduction.

Uncertainty-Annotated Databases

Uncertainty-annotated databases (UA-DBs) model a set of possible worlds, each being a deterministic database that is a possible version of reality. The idea is that we do not know for certain the true state of the world, but know that it is one of the possible worlds, we just do not know which one. UA-DBs model large sets of possible worlds compactly using intervals to encode sets of possible values and multiplicity intervals to encode uncertainty about the existence and number of duplicates of a row. For example, consider a patient table where some of the heart rate and blood pressure measurements are uncertain as the measurement device has a tolerance of +-5. So for example, Peter's heart rate is only known to be with [100,110] as the measured value was 105. Furthermore, for some patients there are multiple measurements and we do not know which one is the correct one.

pname	heartrate	bloodpressure
Peter	135 [130,140]	78 [73,83]
Bob	90 [85,95]	90 [85,95]
Bob	95 [90,100]	75 [70,80]
Alice	120 [115,125]	85 [80,90]

This uncertainty in the data should be taken into account when evaluating queries over the data. For example, consider a query that determines which patients should receive heart treatments (everybody with a heartrate over 120):

SELECT pname, heartrate
FROM vitals
WHERE heartrate > 120

The uncertainty in the reported heart rates needs to be taken into account to make sure that (i) patients that need treatment will get treated, (ii) patients that do not need heart treament are not treated, and (iii) for patients for which this cannot be conclusively determined we do follow-up and do additional measurements / diagnosis. UA-DBs enable modeling of data uncertainty and answering such questions by bounding all possible answers to a query evaluated over an uncertain database.

In an uncertainty-annotated table, for each attribute A there are two additional attributes A_lb and A_ub which store an interval [l,u] covering the possible values that A can take, i.e., we are uncertain about the true value of A but know that it lies in between l and u. Furthermore, each table has three additional attributes r_cet, r_bst, and r_pos which record how many copies of the row may exist (lower bound, a guess, and an upper bound). This is a relatively simple model of uncertainty, but is powerful in that it allows for efficient evaluation of expressive queries over uncertain data and can model (conservatively) a wide range of data quality issues and sources of uncertainty. For example, the set of possible worlds for the uncertain vitals table from above can be over-approximated using the following UA-DB:

pname	heartrate	bloodpressure	pname_ub	heartrate_ub	bloodpressure_ub	pname_lb	heartrate_lb	bloodpressure_lb	r_cet	r_bst	r_pos
Peter	135	78	Peter	140	83	Peter	130	73	1	1	1
Bob	90	90	Bob	95	95	Bob	85	85	0	1	1
Bob	90	90	Bob	100	80	Bob	90	70	0	1	1
Alice	120	85	Alice	125	90	Alice	115	80	1	1	1

Note how the uncertainty about which of Bob's measurement records exist is expressed using 0 as the lower bound on the number of copies of this tuple that exist.

Running queries

To evaluate a query over an uncertain database, wrap the query in URANGE(...) and for each table select a method of how to model the uncertainty in the table as a UA-DB relation. GProM then generates a query that returns a UA-DB that is guaranteed to bound all possible query answers. Reconsider the heart treatment query from above. By appending IS RADB we tell the system that the vitals table (as shown above) is already a UA-DB.

URANGE(
SELECT pname, heartrate
FROM vitals IS RADB
WHERE heartrate > 120
)

GProM returns the following UA-DB table:

 pname | heartrate | ub_pname | lb_pname | ub_heartrate | lb_heartrate | cet_r | bst_r | pos_r |
------------------------------------------------------------------------------------------------
 Peter | 135       | Peter    | Peter    | 140          | 130          | 1     | 1     | 1     |
 Alice | 120       | Alice    | Alice    | 125          | 115          | 0     | 0     | 1     |

Only Peter and Alice could need heart treatment. GProM has correctly determined that Bob cannot possibly fulfill the condition of the query. Peter is guaranteed to need treatment (for any of his possible heartrate values) and, thus, is certainly in the query result (cet_r = 1). However, for Alice we cannot say with certainty whether she should get treated and, thus, may or may not be in the result of our query. This is encoded in the multiplicity lower bound for the tuple (cet_r = 0, i.e., this row may exists zero times = this row could not be part of the result). That means for Alice further diagnosis is needed to determine whether she should receive heart treatment. Now consider an administrator that wants to determine how many patients needs heart treatment to schedule operation room allocations. This can be determined using this query:

URANGE (
SELECT count(*) AS numtreat
FROM vitals IS RADB
WHERE heartrate > 120
);

GProM returns:

 numtreat | ub_numtreat | lb_numtreat | cet_r | bst_r | pos_r |
---------------------------------------------------------------
 1        | 2           | 1           | 1     | 1     | 1     |

Thus, there are between 1 and 2 patients that need heart treatment (for sure Peter and potentially also Alice).

From Incomplete, Probabilistic, and dirty data to UA-DBs

GProM provides several options for translating from other uncertain data models into UA-DBs that then can be queried using GProM.

Tuple Independent Databases (TIDB)

In a tuple independent database, each row is associated with a probability encoding the rows marginal probability of existing (rows are considered to be independent probabilistic events). To use a TIDB as an input, use IS TIP (pattr) where pattr stores the tuple probabilities.

Oracle SQL - Postgres:[email protected]:blackboxtest$SELECT * FROM tiptest;
 a | p   |
----------
 1 | 0.1 |
 2 | 0.3 |
 3 | 0.9 |
 4 | 1   |
 1 | 0.1 |
 2 | 0.3 |
 3 | 0.9 |
 4 | 1   |
Oracle SQL - Postgres:[email protected]:blackboxtest$URANGE (SELECT * FROM tiptest IS TIP (p));
 a | ub_a | lb_a | cet_r | bst_r | pos_r |
------------------------------------------
 1 | 1    | 1    | 0     | 0     | 1     |
 2 | 2    | 2    | 0     | 0     | 1     |
 3 | 3    | 3    | 0     | 1     | 1     |
 4 | 4    | 4    | 1     | 1     | 1     |
 1 | 1    | 1    | 0     | 0     | 1     |
 2 | 2    | 2    | 0     | 0     | 1     |
 3 | 3    | 3    | 0     | 1     | 1     |
 4 | 4    | 4    | 1     | 1     | 1     |

Block Independent Databases (BIDB)

In a block independent database (BIDB), for each row there may exist multiple versions of the row, each associated with a probability of being the true version of this rows. Different versions of a row are considered disjoint events (only one version of a row can exist) and rows are again independent probabilistic events. To use a BIDB as an input, use IS XTABLE (tid,pattr) where pattr stores the tuple probabilities and tid identifies rows (all rows with the same tid value are considered to be the alternatives of a row):

Oracle SQL - Postgres:[email protected]:blackboxtest$SELECT * FROM xtabletest;
 a | b | tid | p   |
--------------------
 1 | 1 | 1   | 0.6 |
 1 | 2 | 1   | 0.2 |
 1 | 3 | 1   | 0.1 |
 2 | 1 | 2   | 1   |
 3 | 1 | 3   | 0.5 |
 3 | 2 | 3   | 0.5 |
 4 | 4 | 4   | 0.3 |
Oracle SQL - Postgres:[email protected]:blackboxtest$URANGE (SELECT * FROM xtabletest IS XTABLE (tid,p));
 a | b | ub_a | lb_a | ub_b | lb_b | cet_r | bst_r | pos_r |
------------------------------------------------------------
 1 | 1 | 1    | 1    | 3    | 1    | 0     | 1     | 1     |
 2 | 1 | 2    | 2    | 1    | 1    | 1     | 1     | 1     |
 3 | 1 | 3    | 3    | 2    | 1    | 1     | 1     | 1     |
 4 | 4 | 4    | 4    | 4    | 4    | 0     | 0     | 1     |