Lenses and Adaptive Schemas - UBOdin/mimir GitHub Wiki
Mimir uses data cleaning operators called Lenses and Adaptive Schemas to automate the preliminary steps of data curation. To do this, Mimir needs to make guesses about how to transform your data. Mimir will make the best guesses that it can, but it can't always be right.
Instead, Mimir will communicate with you about what it's doing to your data: you're not locked into the choices it makes. Mimir will warns you whenever you run a query that it's making guesses about (with attributes in red or rows in grey). You can use ANALYZE
to identify which guesses it made, and FEEDBACK
to confirm or correct any of these guesses.
Lenses
Lenses transform one table at a time, typically to enforce some sort of constraint or cleaning process on the table. Lenses behave like views, but produce outputs that have been cleaned in one of several ways.
TYPE_INFERENCE
The Type Inference lens detects and assigns stronger types to the output of a query. A typical application is to newly imported CSV or JSON data: The type inference lens detects, for example, columns that are mostly integers or mostly floats, and typecasts the entire column accordingly.
'CREATE' 'LENS' name 'AS' query 'WITH' 'TYPE_INFERENCE' '(' minSupport ')'
- minSupport: The percent of data that Mimir needs to see conforming to a particular type before it accepts the default (string) type.
MISSING_VALUE
The missing value lens detects NULL values in specified columns, and comes up with a way to replace them. A typical application is to correct errors in the Type Inference lens, or impute alternatives.
'CREATE' 'LENS' name 'AS' query 'WITH' 'MISSING_VALUE' '(' attrName[, attrName[, ...]] ')'
- attrName: A list of attributes to remove NULL values from
SCHEMA_MATCH
The schema matching lens forces its input to conform to the specified schema, remapping or replacing columns as needed.
'CREATE' 'LENS' name 'AS' query 'WITH' 'SCHEMA_MATCH' '(' '\'' attrName attrType '\''[, '\'' attrName attrType '\''[, ...]] ')'
- ''' attrName attrType '''*: Quoted strings containing the name of an attribute and the type. This will be the schema of the generated lens
KEY_REPAIR
The key repair lens is analogous to SELECT DISTINCT
, but more like a group-by aggregate. You identify one or more columns when defining the lens. The lens will contain one row for each distinct value of those columns. If there is no duplication in those columns, the output will be identical to the input. If there is duplication, duplicates will be folded together like a group-by aggregate according to the following rules:
- If there are multiple distinct values being folded together, Mimir will pick one arbitrarily (e.g., as in
FIRST
). As with any other lens, you can override this decision withANALYZE
andFEEDBACK
. - NULL values will be discarded
- Multiple identical values will be treated as one.
'CREATE' 'LENS' name 'AS' query 'WITH' 'KEY_REPAIR' '(' keyName[, keyName[, ...]][, option[, option[,...]]] ')'
- keyName: Columns to enforce a key constraint on.
- option: One of the following options may be specified as arguments
SCORE_BY(scoreCol)
: scoreCol identifies one of the input columns to use to assign a ranking to conflicted rows. Instead of picking arbitrarily, Mimir will pick values from the row with the highest score when resolving conflicts.ENABLE(FAST_PATH)
: an experimental optimization that precaches a list of already unique keys. When the data is already mostly clean, this can significantly improve query performance.
GEOCODING
The geocoding lens geocodes addresses to gps coordinates.
'CREATE' 'LENS' name 'AS' query 'WITH' 'GEOCODING' '(' 'HOUSE_NUMBER('STRNUMBER_COL')',
'STREET('STRNAME_COL')',
'CITY('CITY_COL')',
'STATE('STATE_COL')',
'GEOCODER('GEOCODER_NAME')' ')'
HOUSE_NUMBER(STRNUMBER_COL)
: STRNUMBER_COL is the name of an attribute that contains the house number.STREET(STRNAME_COL)
: STRNAME_COL is the name of an attribute that contains the street name.CITY(CITY_COL)
: CITY_COL is the name of an attribute that contains the city name.STATE(STATE_COL)
: STATE_COL is the name of an attribute that contains the state name.GEOCODER(GEOCODER_NAME)
: GEOCODER_NAME is a string specifying the geocodes to use: (GOOGLE, OSM)
Adaptive Schemas
Adaptive schemas dynamically shred some input data, dynamically creating new relational tables that provide an easier-to-digest, more representative view on your data. Like Lenses, the shredding process relies on guesses. To explore and interact with the views, you can use Mimir's system catalog tables.
To see all the available tables:
SELECT * FROM mimir_sys_tables
To see all the attributes for table 'BOB', use
SELECT * FROM mimir_sys_attrs WHERE table_name = 'BOB'
Like queries over lenses, attributes and tables may be listed as uncertain. Like lenses, you can use ANALYZE
and FEEDBACK
to assess, confirm, and/or correct Mimir's guesses about how to shred the schema.
Tables in a given adaptive schema can be referenced as schemaName.tablename
DISCALA_ABADI
The DiScala/Abadi adaptive schema generator implements a schema normalization technique proposed by Michael DiScala and Daniel Abadi. This technique detects weak functional dependencies in its input, grouping attributes by shared dependencies, and creating master-detail views for each group. Common use cases include exploring CSV files with significant denormalized data.
'CREATE' 'ADAPTIVE' 'SCHEMA' schemaName 'AS' query 'WITH' 'DISCALA_ABADI' '(' ')'
DETECT_HEADER
The Detect Header adaptive schema generator detects if a loaded CSV file has a header or not. The table is initially loaded into the database assuming that it has no header. The DETECT_HEADER adaptive schema takes a sample of size N from the data and checks the similarity of tuples(2 to N) with the first tuple to determine if the first tuple is a header row or not.
'CREATE' 'ADAPTIVE' 'SCHEMA' schemaName 'AS' query 'WITH' 'DETECT_HEADER' '(' ')'