Concepts - UBOdin/mimir GitHub Wiki
Mimir Conceptual Overview
The goal of this document is to provide a more conceptual overview of Mimir than simple code documentation can accomplish. Topics covered include Mimir's internal algebraic query representation, Mimir's C-Tables-based data model for ambiguous, incomplete, and probabilistic data, and the two main constructs in Mimir: Models and Lenses.
You may find it convenient to follow along with the documentation for the class mimir.Database. This class serves as the central exchange for everything that happens in Mimir. Different components of Mimir are modularized and farmed out to different sub-packages, but Database
includes references to all of them and convenience methods for interacting with multiple components at once. Database also includes the two main methods for running queries in Mimir:
db.query(q)
: Compile, optimize, and run a query through the Mimir wrapper.
Below, when we refer to components defined in the database class, we'll mention how they are referenced. By convention, the Database
class appears throughout the Mimir codebase with the name db
, so for example, the view manager would typically be referenced as db.views
Table of Contents
Relational Algebra and Expressions
Internally, Mimir represents queries in an adapted form of Codd's Relational Algebra. This section overviews the Abstract Syntax Trees or ASTs used to to represent queries and primitive-valued expressions.
Database Programming
Mimir internal code needs to be able to interact with the database in a number of ways, from querying the backend, to managing internal Mimir state. This section outlines the tools that Mimir includes for doing so.
SchemaProviders
Tables accessible to Mimir are defined through an abstraction called SchemaProvider. This section discusses querying the SystemCatalog, and for defining new SchemaProviders.
Editing the Parser
It may occasionally be necessary to edit the parser to add new SQL commands. This section provides a high-level overview of the FastParse library and discusses common gotchas that arise during adding new commands to the parser.
Unified Statistics Tools
A number of components of Mimir share the need for obtaining certain kinds of statistics or summary structures of tables in the backend database. The functionality needed to assemble these can be found in the mimir.statistics
package.
- An (approximate) Functional Dependency graph for a de-normalized table.
- Detecting columns that are likely to represent sequential identifiers for data (i.e., that are likely to define an ordering over the table).
C-Tables and Incomplete Databases
To capture ambiguity and uncertainty in data, Mimir uses an encoding strategy called Virtual C-Tables. This section begins by introducing principles of incomplete databases, starting with the high-level conceptual Possible Worlds Semantics, before introducing successively more refined and practical representations (V-Tables, C-Tables, and Virtual C-Tables).
Wrapping ML Tools
This section brings everything together, introducing the two key components of Mimir: (1) Models, wrappers around existing ML tools, frameworks, and techniques, and (2) Lenses, structural wrappers that allow Models to dictate how data should be transformed.