Concepts Models - UBOdin/mimir GitHub Wiki

Interfacing With ML Tools: Models and Lenses

In the previous section we talked about incomplete databases, databases that had placeholders. Now let's look at how those placeholders get filled in. There are two main parts. First, a Model abstractly defines how to "plug in" values for one or more placeholders. Second, Mimir provides a set of task-specific rules called Lenses that simultaneously instantiate both a C-Table View along with the Model(s) necessary to plug in values for it.

Models

Parts of a Model

A model defines a set of rules for how to plug values into a given expression. This may seem abstract, and it is. To maximize generality, Mimir requires only 5 basic pieces of functionality from a Model for each placeholder created:

Make a "best-effort" guess about the value that should be plugged in (bestGuess(...)).
Generate a repeatable sample of a value that could be plugged in (sample(...)).
Provide a human-readable explanation about the underlying decision that the model is making and why it chose the way it did (reason(...)).
Provide type information about the model's output (getType(...)).
Support Java serialization via Serializable or provide a serializer and de-serializer.

A single model may define values for multiple labeled nulls, which we sometimes call variables (hence Variable-Generating Terms, or VGTerms). This is useful for (at least) two reasons. First, especially in cases involving correlated variables, it may be helpful to have a single Model object that defines multiple, entirely different types of variable. Second, it may be necessary to define different placeholder variables for each row of data (e.g., when repairing specific cells of data).

The idx, or index parameter addresses the first concern. Throughout Mimir you'll see variables (or rather categories of variable) identified by a 2-Tuple (Model, Int). The main advantage of using idx as a variable selector is that it's static (i.e., available at compile-time). This has a few implications:

Because it's available at compile-time, variables distinguished by idx can have different types.
The idx field is ignored when seeding the random generator for sampling, making it possible to use idx to differentiate between different correlated random variables.

It should be noted that a common case is a Model that only defines one (category of) variables, and a SingleVarModel wrapper class is available.

The args, or arguments parameter addresses the second concern, allowing for instantiation of an arbitrary number of variables at runtime. The details of these arguments are up to the model, there may be any number of them, and they're usually given as Expressions to be evaluated at runtime (see VGTerm). This also has a few implications:

Because it's not available until runtime, compile-time operations (i.e., getType) will only get type information rather than actual values.

Model Names

The name field of a Model is particularly important, as it is used to distinguish between multiple occurrences of the same model. This is particularly important for ModelManager discussed below, but it bears repeating. Be careful not to assign the same name to multiple distinct models. Bad things will happen.

Generating Repeatable Samples

The Model.sample(...) method needs to be discussed a little more, as implementing it correctly can be subtle.. In general, Mimir expects the sample method to be deterministic. Let me repeat that, because it's really important. If you call Model.sample twice with exactly the same arguments, you must get back the same value.

To pull this off, while still allowing for stochasticity, Mimir provides a pre-seeded PRNG via the randomness parameter. The seeding process incorporates:

A global seed value (i.e., the current possible world)
A hash of the model's name
A hash of the args

Note: The seeding process does not incorporate the idx parameter, making it possible to control the correlation between variables. In general, if two categories of variable are not correlated, they should be defined via separate Model objects with independent names.

There are a number of good resources for how to translate an arbitrary source of randomness into any sort of random expression you're interested in. I recommend a book called Numerical Recipes.

Traits

Additional functionality in a Model may be exposed through traits. As of now, these are not implemented or used yet, but some ideas include:

FiniteDomain: The model has a finite set of allowable values.
HardBounds: The model has hard upper and/or lower bounds.
Confidence: The model can assess how reliable it feels it can be on the available data.

Basic Models

Mimir defines a few models for common cases in BasicModels.scala

SingleVarModel: A utility wrapper that ignores the idx field.
NoOpModel: A model that does nothing except pass through its (one) argument. Generally only really useful to attach a "note" to a value, as the reason will continue to show up.
UniformDistribution: Primarily used for testing purposes, this model generates a random value in the range (0.0 to 1.0)

Persistence

The ModelManager class is responsible for persisting models between runs of a database. Models are stored as Base64 CLOBs in the MIMIR_MODELS table of the backend database. The main methods of ModelManager are self-explanatory:

persistModel(...) stores a model in the backend database (using Model.name as an identifier)
dropModel(...) removes a model from the backend database (equivalent to DELETE FROM MIMIR_MODELS WHERE name=?)
getModel(...) recovers a model from the backend database (or the cached instance if available)

Reference Counting

A simple form of reference-counted garbage collection is also available. Models may be associated with an owner (identified by a string) using either the extended form of persistModel(...) or with associateOwner(...). The disassociateOwner(...) method removes one such association, while dropOwner(...) removes all associations for one owner. If the last association for a model is dropped, the model itself is dropped as well. Associations are stored in the MIMIR_MODEL_OWNERS table in the backend database.

Serialization

By default, ModelManager uses Java serialization (i.e., Serializable) to encode models for storage on the backend. However, this doesn't work for all objects. Some objects like Weka classifiers have custom serialization code. Others like Database simply cant/shouldn't be serialized. There are two ways around this: @transient and writing a custom serializer.

The first trick is to mark fields of the model as @transient. These fields will be ignored by Java's serializer. If your model implements the NeedsReconnectToDatabase trait, the deserializer will pass you a pointer to the Database object when the model is woken up from serialization. If there are fields that you need to manually serialize this, you can do so by hijacking the serialize method in Model and reconnectToDatabase in NeedsReconnectToDatabase, and using them to encode/decode the data. An example of doing this can be found in WekaModel.scala.

The second approach is to write a custom serializer. The serialize() method of Model can be overridden. It is expected to return two things: An Array[Byte] (aka what Java calls Byte[]) containing the serialized model, and the name of a de-serializer. De-serializers are (currently hardcoded) defined in ModelManager.scala, and identified by a string.

Existing Models

The set of models supported by Mimir is growing. Some basic example models include:

WekaModel: An example of an Imputation model, using the Weka library's classifiers to predict missing values. The WekaModel needs to be provided a query and a column to train on.
EditDistanceMatch: An example of a Schema Matching model, selecting the pair of columns to associate by using string edit distance between the column names using Apache Lucene.
TypeInference: Uses majority voting to assign a type to one column of data.
DefaultMetaModel: An example of a "Meta Model" that serves as a selector between several different types of models. See LensUtils.scala for more details.

Lenses

Mimir presently does not expose Models directly to the user. Rather, Mimir provides users with higher-level primitives that accomplish specific goals like repairing (imputing) missing values or creating the UNION of two tables with different schemas. Concretely, these primitives, called Lenses, define a procedure for constructing: (1) a VG-RA query that transforms the data in the desired way, and (2) a set of (already trained) models for filling in placeholders in the query.

The most common way for a Lens to be instantiated is with a CREATE LENS statement:

CREATE LENS name
AS select_query
WITH lens_type(arg1, arg2, ...)

CREATE LENS behaves like CREATE VIEW. The user defines a query in the AS clause and a single transformation using the WITH clause. The statement essentially defines a view based on the non-deterministically transformed data.

Defining/Instantiating Lenses

Note that unlike Models, there is no explicit Lens class to inherit from. Instead, types of Lenses are defined by constructors with the signature:

(Database,String,Operator,List[Expression]) => (Operator,List[Model])

Apart from a pointer to the database itself, these arguments are drawn directly from the CREATE LENS statement (name, select_query, argN). The return value is the view query (as an Operator) and the models that the operator relies on (as a List[Model]).

The class LensManager, defined in LensManager.scala handles CREATE LENS statements. It looks up and invokes the appropriate Lens constructor, creates a view around the returned operator, and persists all of the returned models (associating them with the Lens as an owner). Please note once again that LensManager handles persistence and that the constructors should not try to interact directly with ModelManager or ViewManager.

As of right now, lens categories are hard-coded into LensManager.scala. To define a new lens, define an appropriate constructor method and add it to the lensTypes variable.

Meta-Model Lenses and the Model Registry

Some lenses admit a variety of different approaches to solving the problem they're trying to attack. For example, schema matching can occur through edit-distance, by matching histograms, or in a variety of other ways. Concretely, lenses including MISSING_VALUE (MissingValueLens.scala)and SCHEMA_MATCHING (SchemaMatchingLens.scala) use what we call a Meta Model.

The idea is to define multiple, context-driven models for value repair. In any given situation, not all of those models are going to be useful. Each of those lenses looks through the set of candidates and creates a new model to choose between them. As of right now, the DefaultMetaModel isn't particularly intelligent about how it does this (it picks the first model on the list), but there's space for defining a more comprehensive approach.