Concepts - UBOdin/mimir GitHub Wiki

Mimir Conceptual Overview

The goal of this document is to provide a more conceptual overview of Mimir than simple code documentation can accomplish. Topics covered include Mimir's internal algebraic query representation, Mimir's C-Tables-based data model for ambiguous, incomplete, and probabilistic data, and the two main constructs in Mimir: Models and Lenses.

You may find it convenient to follow along with the documentation for the class mimir.Database. This class serves as the central exchange for everything that happens in Mimir. Different components of Mimir are modularized and farmed out to different sub-packages, but Database includes references to all of them and convenience methods for interacting with multiple components at once. Database also includes the two main methods for running queries in Mimir:

  • db.query(q): Compile, optimize, and run a query through the Mimir wrapper.

Below, when we refer to components defined in the database class, we'll mention how they are referenced. By convention, the Database class appears throughout the Mimir codebase with the name db, so for example, the view manager would typically be referenced as db.views

Table of Contents

Relational Algebra and Expressions

Internally, Mimir represents queries in an adapted form of Codd's Relational Algebra. This section overviews the Abstract Syntax Trees or ASTs used to to represent queries and primitive-valued expressions.

Database Programming

Mimir internal code needs to be able to interact with the database in a number of ways, from querying the backend, to managing internal Mimir state. This section outlines the tools that Mimir includes for doing so.

SchemaProviders

Tables accessible to Mimir are defined through an abstraction called SchemaProvider. This section discusses querying the SystemCatalog, and for defining new SchemaProviders.

Editing the Parser

It may occasionally be necessary to edit the parser to add new SQL commands. This section provides a high-level overview of the FastParse library and discusses common gotchas that arise during adding new commands to the parser.

Unified Statistics Tools

A number of components of Mimir share the need for obtaining certain kinds of statistics or summary structures of tables in the backend database. The functionality needed to assemble these can be found in the mimir.statistics package.

  • An (approximate) Functional Dependency graph for a de-normalized table.
  • Detecting columns that are likely to represent sequential identifiers for data (i.e., that are likely to define an ordering over the table).

C-Tables and Incomplete Databases

To capture ambiguity and uncertainty in data, Mimir uses an encoding strategy called Virtual C-Tables. This section begins by introducing principles of incomplete databases, starting with the high-level conceptual Possible Worlds Semantics, before introducing successively more refined and practical representations (V-Tables, C-Tables, and Virtual C-Tables).

Wrapping ML Tools

This section brings everything together, introducing the two key components of Mimir: (1) Models, wrappers around existing ML tools, frameworks, and techniques, and (2) Lenses, structural wrappers that allow Models to dictate how data should be transformed.