Concepts SchemaProvider - UBOdin/mimir GitHub Wiki
Multiple components of Mimir can define tables that can be queried in Mimir. As of this writing the complete list includes:
mimir.data.LoadedTables
: Defines all tables instantiated through theLOAD
commandmimir.data.SparkSchemaProvider
: Defines all tables present in the Hive/Derby store of the attached spark cluster.mimir.views.ViewManager
: Defines views instantiated throughCREATE VIEW
andCREATE LENS
.mimir.data.SystemCatalog
: Defines two tablesTABLES
andATTRIBUTES
.mimir.adaptive.AdaptiveSchemaManager
: Defines all tables defined by adaptive schemas (actually provides a list of AdaptiveSchemas wrapped bymimir.adaptive.AdaptiveSchemaProvider
)
Accessing SchemaProviders
SchemaProviders are generally managed by mimir.data.SystemCatalog
, which provides convenience methods for accessing tables, normalizing case-sensitive table names, querying schemas, and so forth. Basic access methods include:
tableOperator(table: ID|Name): Operator
tableOperator(provider: ID|Name, table: ID|Name): Operator
tableExists(table: ID|Name): Boolean
tableSchema(table: ID|Name): Seq[(ID, Type)]
provider(providerName: ID|Name): (ID, SchemaProvider) Methods can be used with
mimir.algebra.ID(always case-sensitive) or
sparsity.Name` (optionally case-sensitive if quoted = true).
MaterializedTableProvider
Some SchemaProvider
s can also be used to persist tables for later use (e.g., for View Materialization or CREATE TABLE AS SELECT
). As of the time of this writing, that includes SparkSchemaProvider
and LoadedTables
. MaterializedTableProvider are SchemaProviders that also implement the optional MaterializedTableProvider
trait. Generally, one MaterializedTableProvider is marked as "preferred" (SparkSchemaProvider
if available and LoadedTables
otherwise). This provider can be obtained by SystemCatalog's materializedTableProvider(): (SchemaProvider with MaterializedTableProvider)
method.
Defining a SchemaProvider
The SchemaProvider trait provides an abstraction for components that need to define queriable tables. A SchemaProvider must define three methods. The first two provide basic information:
def listTables(): Seq[ID]
lists all tables provided by the SchemaProviderdef tableSchema(table: ID): Option[Seq[(ID, Type)]]
provides the typed attributes of each column in the specified table, orNone
if the specified table does not exist. As with otherID
s, table names are case sensitive.
The remaining method instantiates the table. Sub-traits of SchemaProvider provide different ways of instantiating the table.
ViewSchemaProvider
defines tables as views. Theview(table: ID): Operator
method should expand the view into a table.DataFrameSchemaProvider
defines tables as Spark DataFrames. Thedataframe(table: ID): DataFrame
method should return the DataFrame for the table.LogicalPlanSchemaProvider
defines tables as Spark Logical Plans. Thelogicalplan(table: ID): LogicalPlan
method should return the logical plan for the table. Note, the LogicalPlan's output is expected to include a column namedROWID
that contains a unique, stable identifier for each row. RowIndexPlan is a convenience operator that attaches this identifier.
Additional Reference
- Ticket #321: Migrate table definitions from Hive to Mimir Metadata: A journal-style record of the design process of the SchemaProvider infrastructure.