Concepts SchemaProvider - UBOdin/mimir GitHub Wiki

Multiple components of Mimir can define tables that can be queried in Mimir. As of this writing the complete list includes:

  • mimir.data.LoadedTables: Defines all tables instantiated through the LOAD command
  • mimir.data.SparkSchemaProvider: Defines all tables present in the Hive/Derby store of the attached spark cluster.
  • mimir.views.ViewManager: Defines views instantiated through CREATE VIEW and CREATE LENS.
  • mimir.data.SystemCatalog: Defines two tables TABLES and ATTRIBUTES.
  • mimir.adaptive.AdaptiveSchemaManager: Defines all tables defined by adaptive schemas (actually provides a list of AdaptiveSchemas wrapped by mimir.adaptive.AdaptiveSchemaProvider)

Accessing SchemaProviders

SchemaProviders are generally managed by mimir.data.SystemCatalog, which provides convenience methods for accessing tables, normalizing case-sensitive table names, querying schemas, and so forth. Basic access methods include:

  • tableOperator(table: ID|Name): Operator
  • tableOperator(provider: ID|Name, table: ID|Name): Operator
  • tableExists(table: ID|Name): Boolean
  • tableSchema(table: ID|Name): Seq[(ID, Type)]
  • provider(providerName: ID|Name): (ID, SchemaProvider) Methods can be used with mimir.algebra.ID(always case-sensitive) orsparsity.Name` (optionally case-sensitive if quoted = true).

MaterializedTableProvider

Some SchemaProviders can also be used to persist tables for later use (e.g., for View Materialization or CREATE TABLE AS SELECT). As of the time of this writing, that includes SparkSchemaProvider and LoadedTables. MaterializedTableProvider are SchemaProviders that also implement the optional MaterializedTableProvider trait. Generally, one MaterializedTableProvider is marked as "preferred" (SparkSchemaProvider if available and LoadedTables otherwise). This provider can be obtained by SystemCatalog's materializedTableProvider(): (SchemaProvider with MaterializedTableProvider) method.

Defining a SchemaProvider

The SchemaProvider trait provides an abstraction for components that need to define queriable tables. A SchemaProvider must define three methods. The first two provide basic information:

  • def listTables(): Seq[ID] lists all tables provided by the SchemaProvider
  • def tableSchema(table: ID): Option[Seq[(ID, Type)]] provides the typed attributes of each column in the specified table, or None if the specified table does not exist. As with other IDs, table names are case sensitive.

The remaining method instantiates the table. Sub-traits of SchemaProvider provide different ways of instantiating the table.

  • ViewSchemaProvider defines tables as views. The view(table: ID): Operator method should expand the view into a table.
  • DataFrameSchemaProvider defines tables as Spark DataFrames. The dataframe(table: ID): DataFrame method should return the DataFrame for the table.
  • LogicalPlanSchemaProvider defines tables as Spark Logical Plans. The logicalplan(table: ID): LogicalPlan method should return the logical plan for the table. Note, the LogicalPlan's output is expected to include a column named ROWID that contains a unique, stable identifier for each row. RowIndexPlan is a convenience operator that attaches this identifier.

Additional Reference