Concepts CTables - UBOdin/mimir GitHub Wiki

CTables

Mimir's data model is based on an abstraction by Imielinski and Lipski called C-Tables (Other overviews can be found here, here, and here). The representation has quite a few nice theoretical properties, but the main point of interest for us is that it's closed under relational algebra (the output of an RA expression on a C-Table can also be represented as a C-Table).

The basic challenge that the data model is trying to address is uncertainty about the exact values that the data is trying to encode. Let's say you don't know whether a column called "Evaluation" maps to one called "Rating" or one called "Value". There are two possible values for every row of Evaluation... and that's something that classical databases don't really know how to handle. The goal of C-Tables is to encode this uncertainty in a way that allows us to throw a classical deterministic database at a query and get a reasonable result.

Possible Worlds Semantics

Before defining a data model, we first need to settle on semantics. That is, if we run a query on our above example table with an "Evaluation" column, what do we expect it to return? Most folks working on incomplete/probabilistic databases have now settled on a set of rules called "Possible Worlds Semantics".

Let's be a little more precise: We can think of an ambiguous database (D) as a Set of classical deterministic databases ({ D }), each representing one "possible world" or interpretation.

Let's say we have a normal, ordinary, deterministic query Q that we want to run on D (that is, we want Q(D)). Under possible worlds semantics, this query will produce a Set of possible results: { Q(D) | D in D}, one from evaluating the query on each and every possible world.

This may be a little abstract, so let's go through an example based on Schrödinger's famous thought experiment. There are two possibilities for the contents of the box:

 Cat     | Status                  Cat     | Status
---------+---------      OR       ---------+---------
 Fluffy  | Alive                   Fluffy  | Dead

These are the possible worlds, the possible states of the database that in this case consists of just one table. Let's say Schrödinger runs a query on this table, like for example SELECT COUNT(*) AS cats FROM box WHERE status = 'Alive'. Under possible worlds semantics, this produces 2 possible results:

 cats              cats
------     OR     ------
 1                 0

Let's expand his thought experiment a little and say that we actually have two (statistically independent) cats. Now, we have four possible worlds:

 Cat     | Status                  Cat     | Status
---------+---------      OR       ---------+---------
 Fluffy  | Alive                   Fluffy  | Dead
 Rover   | Alive                   Rover   | Alive

                         OR

 Cat     | Status                  Cat     | Status
---------+---------      OR       ---------+---------
 Fluffy  | Alive                   Fluffy  | Dead
 Rover   | Dead                    Rover   | Dead

Just like before, we can run the same query, but observe that now we actually end up with three and not four possible results:

 cats              cats              cats
------     OR     ------     OR     ------
 2                 1                 0

Possible Worlds with Probabilities

If we have a probability distribution over D, that translates naturally into a probability distribution over query results. For example, the probability that Q(D) is some relation (R) is the sum of the probabilities of all possible worlds (D in D) in which Q(D) evaluates to R.

Let's see how this works on the cat example above:

  • With just the one cat and a 50% chance of either possible world, we have a 50% chance of counting 1 living cat, and a 50% chance of counting 0 living cats.
  • Once we introduce 2 cats, there are now possibilities and a 25% chance of each possible world. However, note that 2 of the possible worlds produce exactly the same result on the query, so the end result is an uneven distribution. There's a 25% chance of getting a result of 2 or 0, and a 50% chance of getting the result 1. In other words, the result is the distribution:
  p(X = 0) = 0.25
  p(X = 1) = 0.5
  p(X = 2) = 0.25
  • There's another possibility... let's say both cats are in the same cage. In other words, if one cat lives, the other cat lives (and visa versa). Now, we have 2 worlds with probability 50%, and 2 worlds with probability 0%. The resulting distribution is:
  p(X = 0) = 0.5
  p(X = 1) = 0.0
  p(X = 2) = 0.5

Possible worlds semantics are also agnostic to how we model or store our ambiguous database, as long as we can think of it as a set of possible worlds (even an infinite set). They're also agnostic to query semantics, as long as we can define how to evaluate a query on one specific possible world. Note, that doesn't mean that these rules have to be efficient, just that they're a self-consistent way of thinking about queries on ambiguous data.

C-Tables

So let's make things more efficient. Can we store ambiguous data in a way that allows us to re-use an existing database engine? C-Tables are a step in this direction. A C-Table adds two things to a normal relational table: Labeled Nulls and Condition Columns.

First, we're going to add a construct called a labeled null, which looks at first glance a bit like a SQL NULL value, except with an extra 'label' attached to it (e.g., NULL1, NULL2, NULL3, etc...).

NULL values in SQL serve one of two purposes: (1) They indicate that a value simply doesn't exist (e.g., Spock's last name), or (2) They indicate that we don't know what a specific value should be (e.g., the air speed of an unladen swallow). A labeled null is specifically meant to serve in the latter role: It's a placeholder that says that a value does exist, but we don't know specifically what that value is. Each distinct label refers to one unknown value. This already gives us some added versatility. For example, we can state that NULL1 == NULL1, whereas under classical SQL semantics NULL ≠ NULL.

V-Tables

This also gives us a way to start thinking about some types of ambiguity (what we call attribute-level uncertainty). As an example, consider the CTable (we use {{double curly braces}} for labeled nulls):

  Product   | Name     | Brand    
 -----------+----------+----------
  P123      | Apple 6s | {{NULL1}}
  P124      | Note2    | Samsung  
  P125      | iPhone 7 | Apple

Depending on the value we "plug in" for NULL1, we get a different table. For example, we could plug in "Microsoft" and get the (classical, deterministic) table:

  Product   | Name     | Brand    
 -----------+----------+----------
  P123      | Apple 6s | Microsoft
  P124      | Note2    | Samsung  
  P125      | iPhone 7 | Apple

Or we could plug in "Apple" and get a different (still deterministic, but probably more accurate) table:

  Product   | Name     | Brand    
 -----------+----------+----------
  P123      | Apple 6s | Apple
  P124      | Note2    | Samsung  
  P125      | iPhone 7 | Apple

We're going to leave open exactly how these values get "plugged in" for now (we'll discuss that when we get to Models), but for now you should take away that if I give you a C-Table and values to plug in for all of the labeled nulls, you can give me a normal, deterministic table. In other words, the table with labeled nulls defines an ambiguous relation, and the value assignment defines one specific "possible world".

As an aside, if you're following along with the Imielinski/Lipski paper, a table with just labeled nulls is called a V-Table. To make it a full C-Table, we're going to add one more thing: a condition column.

Condition Columns

The condition column is an annotation on each row of the table that tells you under what conditions the row is part of a possible world. For example, let's run the query SELECT * FROM products WHERE Brand = 'Apple' on the above example table. P124 is clearly not in the output, while P125 is clearly in the input. However, we don't really know whether P123 should be part of the output (at least not until we plug a value in for {{NULL1}}). We can capture this unknown through the condition column

  Product   | Name     | Brand     | [Condition](/UBOdin/mimir/wiki/Condition)
 -----------+----------+-----------+---------------------
  P123      | Apple 6s | {{NULL1}} | {{NULL1}} = 'Apple'
  P125      | iPhone 7 | Apple     | True

Remember that any assignment of values to labeled nulls defines a possible world (and a deterministic version of the table). So, if we plug in 'Microsoft', we get

  Product   | Name     | Brand     | [Condition](/UBOdin/mimir/wiki/Condition)
 -----------+----------+-----------+---------------------
  P123      | Apple 6s | Microsoft | False
  P125      | iPhone 7 | Apple     | True

Note that the value being plugged in has to be the same for ALL occurrences of {{NULL1}} in the C-Table. You can't replace one with 'Microsoft' and the other with 'Apple'. You could however, have two different labeled nulls: {{NULL1}} and {{NULL2}}.

To finish converting the C-Table to a deterministic table, we remove any rows for which the condition is false and discard the condition column. In other words, the deterministic table represented by the assignment {{NULL1}} -> 'Microsoft' is:

  Product   | Name     | Brand
 -----------+----------+----------
  P125      | iPhone 7 | Apple   

Similarly, the assignment {{NULL1}} -> 'Apple' would produce the (deterministic) table:

  Product   | Name     | Brand
 -----------+----------+----------
  P123      | Apple 6s | Apple   
  P125      | iPhone 7 | Apple   

With a few caveats, it's possible to evaluate classical RA (Select, Project, Join, Union) with only slight query rewrites on a C-Table, and amazingly the result is closed. If the input can be represented as a C-Table, so can the output.

Generalized (aka Lazy) C-Tables

One of the big limitations of C-Tables (or technically V-Tables) is that they can't easily capture relationships between variables. Let's say you have the following table:

  Product   | Name     | RatingOutOfFive
 -----------+----------+-----------------
  P123      | Apple 6s | {{NULL1}}
  P124      | iPhone 7 | 4.5
  P125      | Note 2   | {{NULL2}}
  P126      | Surface  | 4.0

You want to add a few more utility columns for different types of data presentation, so you run the query:

SELECT *, 
   RatingOutOfFive / 5.0 AS FractionalRating, 
   RatingOutOfFive * 20 AS PctRating
FROM products

Now, we want to know what should go in the FractionalRating and PctRating columns for P123 and P125. In principle we could create some new labeled nulls:

  Product   | Name     | RatingOutOfFive | FractionalRating | PctRating
 -----------+----------+-----------------+------------------+-----------
  P123      | Apple 6s | {{NULL1}}       | {{NULL3}}        | {{NULL4}}
  P124      | iPhone 7 | 4.5             | 0.9              | 90
  P125      | Note 2   | {{NULL2}}       | {{NULL5}}        | {{NULL6}}
  P126      | Surface  | 4.0             | 0.8              | 80

However, now we lose the relationship between nulls in the same row. In the original Imielinski/Lipski C-Tables paper, the solution to create something they called a global condition. A global condition was a boolean expression that all labeled null assignments had to first satisfy. In this example, we'd have a global condition of:

    {{NULL3}} = ({{NULL1}} / 0.5) AND {{NULL4}} = ({{NULL1}} * 20)
AND {{NULL5}} = ({{NULL2}} / 0.5) AND {{NULL6}} = ({{NULL2}} * 20)

This is messy because it forces projection to have side effects. Projection classically only cares about one tuple at a time, which makes it embarrassingly parallel, fast, and simple. Now, we're introducing a type of projection that creates new rules that the entire world has to follow. Instead, drawing on a predecessor of Mimir, we simply allow cells of a C-Table to contain any suspended computation. In other words, in a classical database, all cells have to be PrimitiveValues. In a generalized C-Tables, a cell can be any Expression. Continuing the example query, we'd get:

  Product   | Name     | RatingOutOfFive | FractionalRating    | PctRating
 -----------+----------+-----------------+---------------------+--------------------
  P123      | Apple 6s | {{NULL1}}       | [ {{NULL1}} / 5.0 ] | [ {{NULL1}} * 20 ]
  P124      | iPhone 7 | 4.5             | 0.9                 | 90
  P125      | Note 2   | {{NULL2}}       | [ {{NULL2}} / 5.0 ] | [ {{NULL2}} * 20 ]
  P126      | Surface  | 4.0             | 0.8                 | 80

Expressions in square braces above are analogous to what PL calls a Future. We don't know what {{NULL1}} and {{NULL2}} are yet, but once we plug values in, we can reduce those columns to specific primitive values.

If you're being careful, you might notice that this still doesn't say much about how the values are assigned, or how more abstract correlations are captured. We'll get back to this below when we discuss models.

Variable-Generating Relational Algebra

Let's now say a few words about how C-Tables are created, at least in the abstract. For more concrete examples and use cases, feel free to skip ahead and take a look at the discussion of Lenses below.

In principle, we could simply define a function that keeps allocating labeled nulls with fresh labels every time it's called. However, it's useful to have more control: We want to be able to create multiple copies of the same labeled null in different parts of a query.

So, we define a function, or more precisely a special AST node called a Variable Generating, or VGTerm (defined in CTables.scala). A VGTerm creates new labeled nulls kind of like a hash function. If you pass it the same parameters, you'll get back an identical labeled null. The technical term for this is a Skolem Function.

For example, consider the table R:

 A | B
---+---
 1 | 1
 2 | 1
 3 | 2

If you run the (slightly oversimplified) query SELECT A, VGTerm(B) AS B FROM R then you'll get back:

 A | B
---+-----------
 1 | {{NULL1}}
 2 | {{NULL1}}
 3 | {{NULL2}}

In other words, the first two rows are constructed with literally identical labeled nulls, because they both passed the same argument to the VGTerm. Taken to an extreme, you can have the same label for every row. Running the query SELECT A, VGTerm() AS B FROM R produces a table with exactly one, identical labeled null throughout.

 A | B
---+-----------
 1 | {{NULL}}
 2 | {{NULL}}
 3 | {{NULL}}

Mimir's ROWID

If you really do want to create unique labeled nulls for each and every row, you have two options. First, if there's an evident key (e.g., A in the example above), you can use that. Even if there isn't, Mimir provides an implicit key called ROWID. Note that unlike ROWID in classical relational databases, which is an implicit feature of a table, Mimir allows ROWID to be used anywhere in a query. For example, the following is a valid query in Mimir

SELECT ROWID FROM R, S

A VGTerm's Static Identity

As it turns out just having a set of arguments isn't completely sufficient for our purposes. We also need some static properties to distinguish different labeled nulls. Thus, a VGTerm is defined in terms of three parts: A Model, An Index, and a list of Arguments. The Model and Index form the static part of a VGTerm's identity, while the Arguments create a dynamic (runtime) component. We'll get back to discussing the Model and Index later, when we introduce Models. For now, it's simply the case that if you have two VGTerms that have the same Model and Index, called on the same primitive values, you get back identical labeled nulls.

Data Dependence

One additional note is that we distinguish between Data-Independent VGTerms (VGTerms with no, or only constant-valued arguments), and Data-Dependent VGTerms (VGTerms that reference variables). The reason for this is that Data-Independent VGTerms are guaranteed to create exactly one labeled null for the entire query output. This allows us to do some nifty things that we'll talk about next.

Generating C-Tables

Earlier, we talked about a database D being non-deterministic and running a deterministic query over it. But where does D come from? VGTerms give us a way to create non-deterministic tables from deterministic data. As a result, we give Relational Algebra with VGTerms a new name: Variable-Generating Relational Algebra (or VG-RA). To create a new (C-)Table, we simply run a VG-RA query over an existing database table. See Lenses below for some more concrete examples of this being used in practice.

Virtual C-Tables

For a variety of reasons I'll discuss in a moment, Mimir never materializes a C-Table. Let me repeat that. C-Tables exist solely as a way to think about what Mimir does, and not as part of the implementation. Remember this, because everything in Mimir's query compilation and evaluation pipeline is about creating the illusion that you're working with C-Tables, without ever actually creating one. We call this illusion Virtual C-Tables.

Why doesn't Mimir actually use C-Tables?

  • Labeled Nulls Don't Exist: No major database vendor has support for labeled nulls. This means that we'd need to create an entirely new database engine, preventing users from leveraging their existing infrastructure and creating more, unnecessary bugs.
  • Approximations Aren't Enough: There are a lot of clever tricks people have played to fake the lack of labeled nulls, but the typical approach is to limit one's self to C-Tables where each labeled null has a discrete, finite set of possible values. This isn't always enough.
  • Aggregation is Hard: Generalized C-Tables can, in principle, encode the result of aggregate queries as a really really really big arithmetic expression. However, such expressions can't really be encoded or analyzed efficiently.
  • They're Expensive: C-Tables have a tendency to do unpleasant things like turning equi-joins into cartesian products. For example, in a join, any labeled null value joins with all values on the opposite side.
  • They're Unnecessary: As we discuss here, here, and in much more depth here, full C-Tables are often overkill. Depending on what the user wants, it may be possible to answer through some much simpler computations.

So what does Mimir do instead?

Mimir is perfectly happy to talk about queries that produce C-Tables, but will not actually try to evaluate such queries on its backend database. Instead, we evaluate a simpler query that accomplishes a specific task that would normally be performed on a C-Table.

Remember that we define C-Tables in terms of VG-RA queries. Those queries are never executed. Rather, given a specific task, we rewrite the query into a classical, deterministic relational algebra (D-RA) query that produces a deterministic answer to that specific query. Mimir is currently set up for three types of tasks:

  1. Create a new, non-deterministic table from existing data (ViewManager)
  2. Obtain a best guess of the query result (Compiler).
  3. Generate an explanation for exactly one cell (CTExplainer).

The rewritten query is then deployed to the backend database (db.backend, which is any subclass of Backend).

Creating New C-Tables

We never materialize a C-Table, so we can never do an SELECT INTO ... or INSERT INTO ... (SELECT ...) operation using VG-RA. All data stored in and manipulated by the backend database is deterministic.

Instead, C-Tables are exclusively stored as views (See ViewManager). Views are then re-assembled into full query trees within Mimir as a query is compiled (ResolveViews).

Although not available yet, we plan to eventually add support for materializing these views to improve performance.

Generating (Annotated) Best Guesses

One of the main goals of Mimir is to make uncertainty accessible, and a large part of this is avoiding overwhelming the user with too much information. The user's first interaction with uncertain data is through query results. We avoid overwhelming them by providing a very simple interface:

  • Labeled nulls are replaced with the system's best guess for what should go there
  • Values based on or derived from labeled nulls are clearly identified as such (by red highlighting). That is, when the user poses a query over uncertain data, the response we give them at first is the system's best guess at a response, with "guesses" highlighted in red.

We'll talk about guessing and Models in greater depth below, but for now, just assume that we have a way of getting a value assignment from the system (that is, a PrimitiveValue for each labeled null) that represents a "Best Guess", and use that to define the "Best" possible world.

In short, we have a VG-RA query, and an assignment of values to each labeled null in the resulting C-Table. There are three challenges that we need to tackle to produce the Best Guess result.

  1. How do we 'plug in' the best guess values into the query without materializing the C-Table first?
  2. How do we identify which cells/rows to highlight as uncertain?
  3. How do we make it possible for the user to get more detail?

Let's look at each challenge in turn.

Plugging in Best Guess Values

Let's say you have a VG-RA query (presented here in SQL-ish). This query, borrowed from our "Missing Value" lens, replaces values of B that are null with a fresh labeled null.

SELECT A, 
       CASE WHEN B IS NULL THEN VGTerm(ROWID) ELSE B END AS B 
FROM R

We'd like to be able to evaluate this query on the backend database, plugging in specific best guess values for each labeled null where necessary. We've considered four approaches for use in Mimir.

The conceptually simplest approach is to try to use a User-Defined Function and run something like the query

SELECT A, 
       CASE WHEN B IS NULL THEN GetBestGuessFor(ROWID) ELSE B END AS B 
FROM R

As of right now, this approach is supported only on SQLite, as it requires the UDF to call back into the Mimir application. This is unsupported in Oracle's SQL+ and pgSQL (as far as we know). This feature is enabled by default in the test cases, and can be enabled in the command line interface with -X INLINE-VG

The second and third approaches both use a component of Mimir called the BestGuessCache. Before running a query (typically as part of view creation in LensManager), the best guess cache pre-computes and stores all relevant best guess values, indexed by the VGTerm's arguments (i.e., the label).

One way to use the best guess cache is to insert the lookup directly as a nested subquery.

SELECT A, 
       CASE WHEN B IS NULL THEN (SELECT GUESS FROM CACHE WHERE IDX=R.ROWID) ELSE B END AS B 
FROM R

This was initially implemented as a bit of a hack in RAToSQL, since Mimir's VG-RA doesn't (yet) support expression-nested subqueries. It might be nice to get it back at some point, but we found that our backend databases' optimizers weren't great at figuring out how to deal with joins if one side of the join predicate was a nested lookup query. That led us to our third approach: Left-Outer Joins:

SELECT A, 
       CASE WHEN B IS NULL THEN CACHE.GUESS ELSE B END AS B 
FROM R LEFT OUTER JOIN CACHE ON (R.ROWID = CACHE.IDX)

It's not ideal, but this is what happens in Mimir right now.

The last approach only works for some VGTerms. Specifically, if we have a data-independent VGTerm, like for example the following query used by our Schema Matching lens:

SELECT A,
       CASE VGTerm() WHEN 'B' THEN B 
                     WHEN 'C' THEN C 
                     ELSE NULL 
       END AS B
FROM R

The value of the VGTerm is not data-dependent and has the same value throughout the entire query. This means we can evaluate it as part of query compilation. In other words, if the best guess assignment to the single labeled null it is "B", then we can just evaluate the following query:

SELECT A,
       CASE 'B' WHEN 'B' THEN B 
                WHEN 'C' THEN C 
                ELSE NULL 
       END AS B
FROM R

or equivalently, just

SELECT A, B FROM R

This functionality for plugging in and simplifying data-independent VGTerms resides in InlineVGTerms.

Uncertainty Annotations

Note: The components of Mimir discussed in this section are in flux. In our ongoing effort to merge Mimir and GProM, portions of the functionality discussed here will be migrated to use existing capabilities of GProM.

Plugging in best guesses is only part of the solution. We still need to be able to communicate to the front-end which cells and rows to highlight. That is, we need to be able to identify which cells have values that are, or are derived from labeled nulls, and which rows have a condition annotation based on a labeled null.

This happens in two components of Mimir. First, there is a function in CTPercolator called percolateLite. The main goal of this function is add a set of new "annotation" columns that tell us which cells/rows are deterministic. Let's illustrate with an example. Start with some data, let's call the table R:

 A | B
---+------
 1 | null
 2 | 4

Now let's have a simple variable-generating query (the Missing Value query from before).

SELECT A, CASE WHEN B IS NULL THEN VGTerm(ROWID) ELSE B END AS B FROM R

Let's say that the guess for the first row is 3. In that case, when we run the best guess query, we'll get:

 A | B
---+----
 1 | 3
 2 | 4

But we also want to track which columns and rows are deterministic. The goal of percolateLite is to create a table that looks more like:

 A | B | A_IS_DET | B_IS_DET | ROW_IS_DET
---+---+----------+----------+------------
 1 | 3 |   YES    |    NO    |    YES
 2 | 4 |   YES    |   YES    |    YES

The added _IS_DET columns carry 1 bit of information: Whether or not each column/row in the original table is deterministic. If NO, the cell/row will be highlighted in red when it is displayed to the user.

To create the _IS_DET columns, percolateLite relies on a helper method in CTAnalyzer called compileDeterministic. This function analyzes an Expression to figure out under what conditions the expression are guaranteed to be deterministic. The basic rules are:

  • If the expression is a PrimitiveValue, it's always deterministic.
  • If the expression is a VGTerm, it's always non-deterministic.
  • If the expression is a Var, it's whatever the referenced variable is.
  • If the expression is anything except something listed above, or a Conditional, or And/Or expression, it's deterministic if all of its children are deterministic.

We special case Conditionals (i.e., If-Then-Else expressions), because they can create situations where the expression is non-deterministic only for certain rows.

  • When the If condition is non-deterministic, everything is non-deterministic
  • When the If condition and Then clause are deterministic, but the Else clause is non-deterministic, then the entire expression is deterministic on rows of data where the If condition is satisfied.
  • When the If condition and Else clause are deterministic, but the Then clause is non-deterministic (as in the missing value example), then the entire expression is deterministic on rows of data where the If condition is not satisfied.

And/Or expressions follow similar rules (e.g., X AND Y is deterministic when X is deterministically false)

Thus, percolateLite will create a query of the form

SELECT A, CASE WHEN B IS NULL THEN VGTerm(ROWID) ELSE B END AS B 
       'YES' AS A_IS_DET, 
       CASE WHEN B IS NULL THEN 'NO' ELSE 'YES' END AS B
       'YES' AS ROW_IS_DET
FROM R

As an optimization, percolateLite observes that A_IS_DET and ROW_IS_DET are independent of the data and returns a simpler query. This, combined with efforts to avert name collisions, is the source of most, if not all of the complexity in the function.

Before replacing VGTerms with their best guess lookup equivalents, Compiler will call percolateLite to extend the results. Then, after executing the query, but before returning the query results, it wraps the ResultSet in a special ResultSetIterator that hides these special annotation columns from its output schema and instead exposes them through a programmatic API (the deterministicCol() and deterministicRow() methods).

Provenance Tracking

Note: The components of Mimir discussed in this section are in flux. In our ongoing effort to merge Mimir and GProM, portions of the functionality discussed here will be migrated to use existing capabilities of GProM.

The last part of basic query execution is making it possible to get more information about uncertainty in a specific row or cell. To accomplish this, we will need to be able to repeatedly re-derive exactly one row of the data (i.e., the cell's row).

We accomplish this by adding another annotation column (technically several annotation columns) that uniquely identifies each output row. The code responsible for creating this annotation is in the Provenance object. Its compile method adds one or more columns to its input query that can be used to uniquely identify each row of the output.

  • A Table's implicit ROWID attribute uniquely identifies each row.
  • A Project or Select uses the unique identifier of its input query
  • A Join combines unique identifiers from both of its inputs
  • A Union adds another identifier column that identifies whether each row came from the Left- or Right-hand side of the Union.
  • An Aggregate uniquely identifies each output row with the group-by attributes. Incidentally, these identifier columns are what makes it possible to use ROWID anywhere in the query.

Provenance's counterpart, Tracer can then reverse-engineer the query and combine it with a provenance token (represented by the RowIdPrimitive class) to reconstruct exactly the one row of data that we're asking for.

Explaining Uncertainty

After we present the user with a best guess query result, we want the user to be able to explore the uncertainty of the result in greater depth. The CTExplainer class provides support for generating more detailed summaries of uncertainty in a given cell or row. The key methods are explainCell, which takes a query, a provenance token, and a column name, and explainRow which only takes the first two.

To generate an explanation, CTExplainer essentially materializes exactly one row of a Generalized C-Table. Let's go through the process step-by-step.

  1. CTExplainer uses Tracer to transform the input query into one that computes exactly the one row for which we have a provenance token.

  2. CTExplainer uses rules best explained here to "percolate" all VGTerms into a single projection at the top of the query, leaving only a deterministic query below.

  3. CTExplainer uses the backend to evaluate everything except that final projection and then replaces all of the Vars with their values from the backend. The result is an Expression that is exactly the future that we'd get in a C-Table.

Using this future, and a capability that Models provide for generating sample values, CTExplainer generates several values, any of the following that are applicable:

  • A set of example values.
  • 95% confidence bounds. (for numerical values)
  • Expectation and Standard Deviation (for numerical values)
  • Approximate Probability (for boolean values)
  • A list of human-readable explanations for the sources of uncertainty affecting the result.

These values are returned encapsulated in one of several subclasses of Explanation, or as a JSON object.