Concepts SchemaProvider - UBOdin/mimir GitHub Wiki
Multiple components of Mimir can define tables that can be queried in Mimir. As of this writing the complete list includes:
mimir.data.LoadedTables: Defines all tables instantiated through theLOADcommandmimir.data.SparkSchemaProvider: Defines all tables present in the Hive/Derby store of the attached spark cluster.mimir.views.ViewManager: Defines views instantiated throughCREATE VIEWandCREATE LENS.mimir.data.SystemCatalog: Defines two tablesTABLESandATTRIBUTES.mimir.adaptive.AdaptiveSchemaManager: Defines all tables defined by adaptive schemas (actually provides a list of AdaptiveSchemas wrapped bymimir.adaptive.AdaptiveSchemaProvider)
Accessing SchemaProviders
SchemaProviders are generally managed by mimir.data.SystemCatalog, which provides convenience methods for accessing tables, normalizing case-sensitive table names, querying schemas, and so forth. Basic access methods include:
tableOperator(table: ID|Name): OperatortableOperator(provider: ID|Name, table: ID|Name): OperatortableExists(table: ID|Name): BooleantableSchema(table: ID|Name): Seq[(ID, Type)]provider(providerName: ID|Name): (ID, SchemaProvider) Methods can be used withmimir.algebra.ID(always case-sensitive) orsparsity.Name` (optionally case-sensitive if quoted = true).
MaterializedTableProvider
Some SchemaProviders can also be used to persist tables for later use (e.g., for View Materialization or CREATE TABLE AS SELECT). As of the time of this writing, that includes SparkSchemaProvider and LoadedTables. MaterializedTableProvider are SchemaProviders that also implement the optional MaterializedTableProvider trait. Generally, one MaterializedTableProvider is marked as "preferred" (SparkSchemaProvider if available and LoadedTables otherwise). This provider can be obtained by SystemCatalog's materializedTableProvider(): (SchemaProvider with MaterializedTableProvider) method.
Defining a SchemaProvider
The SchemaProvider trait provides an abstraction for components that need to define queriable tables. A SchemaProvider must define three methods. The first two provide basic information:
def listTables(): Seq[ID]lists all tables provided by the SchemaProviderdef tableSchema(table: ID): Option[Seq[(ID, Type)]]provides the typed attributes of each column in the specified table, orNoneif the specified table does not exist. As with otherIDs, table names are case sensitive.
The remaining method instantiates the table. Sub-traits of SchemaProvider provide different ways of instantiating the table.
ViewSchemaProviderdefines tables as views. Theview(table: ID): Operatormethod should expand the view into a table.DataFrameSchemaProviderdefines tables as Spark DataFrames. Thedataframe(table: ID): DataFramemethod should return the DataFrame for the table.LogicalPlanSchemaProviderdefines tables as Spark Logical Plans. Thelogicalplan(table: ID): LogicalPlanmethod should return the logical plan for the table. Note, the LogicalPlan's output is expected to include a column namedROWIDthat contains a unique, stable identifier for each row. RowIndexPlan is a convenience operator that attaches this identifier.
Additional Reference
- Ticket #321: Migrate table definitions from Hive to Mimir Metadata: A journal-style record of the design process of the SchemaProvider infrastructure.