Concepts Models - UBOdin/mimir GitHub Wiki
Interfacing With ML Tools: Models and Lenses
In the previous section we talked about incomplete databases, databases that had placeholders. Now let's look at how those placeholders get filled in. There are two main parts. First, a Model abstractly defines how to "plug in" values for one or more placeholders. Second, Mimir provides a set of task-specific rules called Lenses that simultaneously instantiate both a C-Table View along with the Model(s) necessary to plug in values for it.
Models
Parts of a Model
A model defines a set of rules for how to plug values into a given expression. This may seem abstract, and it is. To maximize generality, Mimir requires only 5 basic pieces of functionality from a Model for each placeholder created:
- Make a "best-effort" guess about the value that should be plugged in (
bestGuess(...)
). - Generate a repeatable sample of a value that could be plugged in (
sample(...)
). - Provide a human-readable explanation about the underlying decision that the model is making and why it chose the way it did (
reason(...)
). - Provide type information about the model's output (
getType(...)
). - Support Java serialization via Serializable or provide a serializer and de-serializer.
A single model may define values for multiple labeled nulls, which we sometimes call variables (hence Variable-Generating Terms, or VGTerms). This is useful for (at least) two reasons. First, especially in cases involving correlated variables, it may be helpful to have a single Model object that defines multiple, entirely different types of variable. Second, it may be necessary to define different placeholder variables for each row of data (e.g., when repairing specific cells of data).
The idx
, or index parameter addresses the first concern. Throughout Mimir you'll see variables (or rather categories of variable) identified by a 2-Tuple (Model, Int)
. The main advantage of using idx
as a variable selector is that it's static (i.e., available at compile-time). This has a few implications:
- Because it's available at compile-time, variables distinguished by
idx
can have different types. - The
idx
field is ignored when seeding the random generator for sampling, making it possible to useidx
to differentiate between different correlated random variables.
It should be noted that a common case is a Model that only defines one (category of) variables, and a SingleVarModel wrapper class is available.
The args
, or arguments parameter addresses the second concern, allowing for instantiation of an arbitrary number of variables at runtime. The details of these arguments are up to the model, there may be any number of them, and they're usually given as Expressions to be evaluated at runtime (see VGTerm). This also has a few implications:
- Because it's not available until runtime, compile-time operations (i.e.,
getType
) will only get type information rather than actual values.
Model Names
The name
field of a Model is particularly important, as it is used to distinguish between multiple occurrences of the same model. This is particularly important for ModelManager discussed below, but it bears repeating. Be careful not to assign the same name to multiple distinct models. Bad things will happen.
Generating Repeatable Samples
The Model.sample(...)
method needs to be discussed a little more, as implementing it correctly can be subtle.. In general, Mimir expects the sample method to be deterministic. Let me repeat that, because it's really important. If you call Model.sample
twice with exactly the same arguments, you must get back the same value.
To pull this off, while still allowing for stochasticity, Mimir provides a pre-seeded PRNG via the randomness
parameter. The seeding process incorporates:
- A global seed value (i.e., the current
possible world
) - A hash of the model's name
- A hash of the
args
Note: The seeding process does not incorporate the idx
parameter, making it possible to control the correlation between variables. In general, if two categories of variable are not correlated, they should be defined via separate Model objects with independent names.
There are a number of good resources for how to translate an arbitrary source of randomness into any sort of random expression you're interested in. I recommend a book called Numerical Recipes.
Traits
Additional functionality in a Model may be exposed through traits. As of now, these are not implemented or used yet, but some ideas include:
- FiniteDomain: The model has a finite set of allowable values.
- HardBounds: The model has hard upper and/or lower bounds.
- Confidence: The model can assess how reliable it feels it can be on the available data.
Basic Models
Mimir defines a few models for common cases in BasicModels.scala
- SingleVarModel: A utility wrapper that ignores the
idx
field. - NoOpModel: A model that does nothing except pass through its (one) argument. Generally only really useful to attach a "note" to a value, as the reason will continue to show up.
- UniformDistribution: Primarily used for testing purposes, this model generates a random value in the range (0.0 to 1.0)
Persistence
The ModelManager class is responsible for persisting models between runs of a database. Models are stored as Base64 CLOBs in the MIMIR_MODELS
table of the backend database. The main methods of ModelManager are self-explanatory:
persistModel(...)
stores a model in the backend database (usingModel.name
as an identifier)dropModel(...)
removes a model from the backend database (equivalent toDELETE FROM MIMIR_MODELS WHERE name=?
)getModel(...)
recovers a model from the backend database (or the cached instance if available)
Reference Counting
A simple form of reference-counted garbage collection is also available. Models may be associated with an owner (identified by a string) using either the extended form of persistModel(...)
or with associateOwner(...)
. The disassociateOwner(...)
method removes one such association, while dropOwner(...)
removes all associations for one owner. If the last association for a model is dropped, the model itself is dropped as well. Associations are stored in the MIMIR_MODEL_OWNERS
table in the backend database.
Serialization
By default, ModelManager uses Java serialization (i.e., Serializable) to encode models for storage on the backend. However, this doesn't work for all objects. Some objects like Weka classifiers have custom serialization code. Others like Database
simply cant/shouldn't be serialized. There are two ways around this: @transient
and writing a custom serializer.
The first trick is to mark fields of the model as @transient
. These fields will be ignored by Java's serializer. If your model implements the NeedsReconnectToDatabase
trait, the deserializer will pass you a pointer to the Database
object when the model is woken up from serialization. If there are fields that you need to manually serialize this, you can do so by hijacking the serialize
method in Model
and reconnectToDatabase
in NeedsReconnectToDatabase
, and using them to encode/decode the data. An example of doing this can be found in WekaModel.scala.
The second approach is to write a custom serializer. The serialize()
method of Model
can be overridden. It is expected to return two things: An Array[Byte]
(aka what Java calls Byte[]
) containing the serialized model, and the name of a de-serializer. De-serializers are (currently hardcoded) defined in ModelManager.scala, and identified by a string.
Existing Models
The set of models supported by Mimir is growing. Some basic example models include:
- WekaModel: An example of an Imputation model, using the Weka library's classifiers to predict missing values. The WekaModel needs to be provided a query and a column to train on.
- EditDistanceMatch: An example of a Schema Matching model, selecting the pair of columns to associate by using string edit distance between the column names using Apache Lucene.
- TypeInference: Uses majority voting to assign a type to one column of data.
- DefaultMetaModel: An example of a "Meta Model" that serves as a selector between several different types of models. See LensUtils.scala for more details.
Lenses
Mimir presently does not expose Models directly to the user. Rather, Mimir provides users with higher-level primitives that accomplish specific goals like repairing (imputing) missing values or creating the UNION of two tables with different schemas. Concretely, these primitives, called Lenses, define a procedure for constructing: (1) a VG-RA query that transforms the data in the desired way, and (2) a set of (already trained) models for filling in placeholders in the query.
The most common way for a Lens to be instantiated is with a CREATE LENS
statement:
CREATE LENS name
AS select_query
WITH lens_type(arg1, arg2, ...)
CREATE LENS
behaves like CREATE VIEW
. The user defines a query in the AS
clause and a single transformation using the WITH
clause. The statement essentially defines a view based on the non-deterministically transformed data.
Defining/Instantiating Lenses
Note that unlike Models, there is no explicit Lens class to inherit from. Instead, types of Lenses are defined by constructors with the signature:
(Database,String,Operator,List[Expression]) => (Operator,List[Model])
Apart from a pointer to the database itself, these arguments are drawn directly from the CREATE LENS
statement (name
, select_query
, argN
). The return value is the view query (as an Operator
) and the models that the operator relies on (as a List[Model]
).
The class LensManager, defined in LensManager.scala handles CREATE LENS
statements. It looks up and invokes the appropriate Lens constructor, creates a view around the returned operator, and persists all of the returned models (associating them with the Lens as an owner). Please note once again that LensManager handles persistence and that the constructors should not try to interact directly with ModelManager or ViewManager.
As of right now, lens categories are hard-coded into LensManager.scala. To define a new lens, define an appropriate constructor method and add it to the lensTypes
variable.
Meta-Model Lenses and the Model Registry
Some lenses admit a variety of different approaches to solving the problem they're trying to attack. For example, schema matching can occur through edit-distance, by matching histograms, or in a variety of other ways. Concretely, lenses including MISSING_VALUE
(MissingValueLens.scala)and SCHEMA_MATCHING
(SchemaMatchingLens.scala) use what we call a Meta Model.
The idea is to define multiple, context-driven models for value repair. In any given situation, not all of those models are going to be useful. Each of those lenses looks through the set of candidates and creates a new model to choose between them. As of right now, the DefaultMetaModel isn't particularly intelligent about how it does this (it picks the first model on the list), but there's space for defining a more comprehensive approach.