Delimited Identifiers - trinodb/trino GitHub Wiki
This document captures the design for adding support for case-sensitive identifiers in Trino. Trino currently does case-insensitive matching of identifiers by lowercasing aggressively across the engine and the SPI. The goal is to bring identifier handling closer to the SQL standard while preserving interoperability with non-SQL-compliant storage engines (Hive, PostgreSQL, MariaDB, etc.).
The SQL standard defines two forms of identifier:
<delimited identifier> ::=
<double quote> <delimited identifier body> <double quote>
<delimited identifier body> ::= <delimited identifier part>...
<delimited identifier part> ::=
<nondoublequote character>
| <doublequote symbol>
<Unicode delimited identifier> ::=
U <ampersand> <double quote> <Unicode delimiter body> <double quote>
<Unicode escape specifier>
The standard defines a case-normal form for non-delimited identifiers, derived by replacing every lower-case (or title-case) character with its Unicode-defined upper-case sequence; all other characters pass through unchanged. The case-normal form is used for determination of identifier equivalence, representation in the Definition and Information Schemas, and representation in diagnostics areas.
Two regular (non-delimited) identifiers are equivalent if their case-normal forms compare equally under an implementation-defined case-sensitive collation. A regular identifier and a delimited identifier are equivalent if the case-normal form of the regular identifier and the body of the delimited identifier compare equally under the same collation. Two delimited identifiers are equivalent if their bodies compare equally; case-normal form is not applied to delimited identifiers. The asymmetry between regular and delimited identifiers is by design.
In effect, regular identifiers are normalized at declaration time. Once an identifier is declared and stored โ in column descriptors, in catalog metadata, in the Information Schema โ it is a string in its canonical form. Equivalence between two stored names reduces to plain string comparison under the implementation-defined case-sensitive collation. Trino's choice of that collation is binary equality.
Trino brings identifier handling closer to the SQL standard wherever possible, while accepting that the underlying connectors may not be SQL-compliant. The design accommodates four categories of connector:
- SQL-compliant (Oracle, SQL Server, Druid, etc.) โ delimited identifiers preserve case; non-delimited identifiers normalize to upper case.
- Case-sensitive but not SQL-compliant (PostgreSQL, etc.) โ delimited identifiers preserve case; non-delimited identifiers normalize to lower case.
- Case-insensitive (Hive) โ delimited and non-delimited identifiers behave the same; identifiers are normalized to a fixed case (lower).
-
Case-preserving and case-insensitive (MariaDB / MySQL columns) โ names are stored with their original case but compared case-insensitively, regardless of whether the reference is delimited.
ColumnA,columna,"ColumnA", and"columna"all resolve to the same column.
A Canonicalizer is supplied by each catalog. It exposes a single method:
String canonicalize(String value, IdentifierKind kind, boolean delimited);IdentifierKind is an SPI enum identifying the kind of object the value is being canonicalized for:
enum IdentifierKind {
SCHEMA,
TABLE, // includes views and materialized views (same namespace)
COLUMN,
}The delimited argument captures whether the identifier appeared between double quotes in the user's SQL text. For stored names โ those returned by the connector and held in RelationType, ColumnMetadata, etc. โ the engine passes delimited = true by convention, since stored names are concrete strings rather than parsed input.
The default for ConnectorMetadata.canonicalize(value, kind, delimited) follows the SQL standard regardless of kind: delimited ? value : value.toUpperCase(ENGLISH). Non-delimited identifiers fold to case-normal form (upper case); delimited identifiers pass through verbatim. Connectors that do not follow standard semantics override.
The IdentifierKind parameter accommodates connectors whose canonicalization rules differ per identifier kind. The motivating case is MariaDB / MySQL, where column names are always case-insensitive while table and schema name behavior is governed by lower_case_table_names and may be case-sensitive.
The delimited parameter accommodates the case-preserving / case-insensitive category โ primarily MariaDB / MySQL column names. In those catalogs, columns are stored with their original case (CREATE TABLE t (ColumnA INT) stores ColumnA) but compared case-insensitively, regardless of whether the reference is delimited. To express that, the catalog's override needs to fold even when the user wrote a delimited identifier:
// MariaDB / MySQL override for COLUMN โ ignores the delimited bit
canonicalize(value, COLUMN, _) = value.toLowerCase(ENGLISH);Without this bit, the only ways to make case-preserving / case-insensitive catalogs work would be:
- Lowercase only non-delimited references, breaking the catalog's actual semantics for delimited references.
- Require the connector to lowercase stored column names on the way out, sacrificing display fidelity โ
SHOW COLUMNSwould showcolumnainstead of the user'sColumnA.
Both are observable departures from the database's native behavior. The delimited bit lets a catalog opt out of the SQL-standard "delimited is verbatim" rule on a per-IdentifierKind basis without losing original case in display.
Most connectors implement the same rule across all kinds and the same rule for delimited and non-delimited input; they ignore one or both parameters.
| Connector | SCHEMA |
TABLE |
COLUMN |
|---|---|---|---|
| Oracle, SQL Server, Druid (SQL-standard default) | delimited ? value : upper(value) |
delimited ? value : upper(value) |
delimited ? value : upper(value) |
| Hive, Lakehouse, Iceberg over Hive metastore | delimited ? value : lower(value) |
delimited ? value : lower(value) |
delimited ? value : lower(value) |
| PostgreSQL, Redshift | delimited ? value : lower(value) |
delimited ? value : lower(value) |
delimited ? value : lower(value) |
| MariaDB, MySQL | delimited ? value : value |
delimited ? value : value |
lower(value) (bit ignored) |
The MariaDB / MySQL COLUMN row is the load-bearing case for the delimited bit. The other rows preserve the SQL-standard property that delimited and non-delimited references behave differently.
For MariaDB / MySQL SCHEMA and TABLE, the override is identity in both bit positions: case is preserved on case-sensitive filesystems (Linux); on case-insensitive filesystems (Windows, macOS) the database itself enforces uniqueness, so Trino doesn't need to.
Catalog names are not canonicalized through this API. Roles, users, and session properties are also not canonicalized through this API; see those sections below.
The engine reaches a per-catalog canonicalizer through Metadata.getCanonicalizer(session, catalogName). Metadata does not have a default implementation of this method; only MetadataManager and the testing-focused implementations of Metadata provide it.
Canonicalization is a one-shot operation performed at the boundary where a syntactic QualifiedName from the AST is converted into a QualifiedObjectName for use in metadata APIs โ typically inside MetadataUtil.createQualifiedObjectName and MetadataUtil.createCatalogSchemaName. After that boundary, names are Strings already canonical for their catalog.
AST nodes are immutable and never carry resolution-time metadata. Identifier exposes only (value, delimited). Identifier.getCanonicalValue() returns the case-normal form for non-delimited identifiers and the body for delimited ones โ i.e., the SQL-standard rule, used for engine-internal identifiers (see below) and unaffected by any catalog's Canonicalizer.
The query planner is not aware of canonicalization. Any necessary canonicalization is completed by the analyzer before plan construction.
QualifiedName is a syntactic-only representation. It is constructed from Identifier AST nodes and stores parts verbatim; it does not normalize. Code that needs canonical names goes through MetadataUtil.createQualifiedObjectName and uses QualifiedObjectName.
SchemaTableName, CatalogSchemaName, ColumnMetadata, and ColumnSchema preserve their input verbatim โ no toLowerCase in their constructors. Each catalog produces stored names per its own canonicalization rules at the point columns / tables / schemas are declared, and the engine stores those names as-is.
This is consistent with the SQL standard's model: identifiers are normalized at declaration time, and stored names are already canonical. The engine does not re-canonicalize them on the way in.
Identifiers that never cross the connector boundary follow the SQL standard, independent of any catalog. This applies to:
- Built-in function names and pattern-recognition function names (
PREV,NEXT,CLASSIFIER,MATCH_NUMBER,FIRST,LAST) - Pattern recognition labels
-
LANGUAGEvalues (e.g.,LANGUAGE PYTHON) - JSON_TABLE column and path names
- Procedure argument names
-
WITHquery names - Set-operation column aliases used in
CORRESPONDING - AST-level equality (e.g.,
CanonicalizationAware)
These use Identifier.getCanonicalValue() (SQL-standard case-normal form) and do not consult any catalog's Canonicalizer.
Field matching against RelationType follows the originating catalog's canonicalization rules. The engine canonicalizes both the user-typed identifier and the stored field name and compares the results by binary equality:
canonicalize(userInput, kind, userDelimited)
.equals(canonicalize(storedName, kind, true))RelationType carries the originating catalog identity so the field-resolution layer can pick the right canonicalizer.
For SQL-standard and case-folding catalogs, canonicalize(stored, kind, true) is identity โ the database already enforced canonical form at storage time, so this side is a no-op in practice, and the rule reduces to "canonicalize the user's identifier and compare against the stored name verbatim."
For MariaDB / MySQL columns, the stored side does real work: the stored ColumnA canonicalizes to columna, which is what the user-side canonicalization also produces regardless of casing or delimiting.
RelationType continues to hold verbatim names; the canonical form is computed at lookup time, optionally memoized.
| Catalog | Stored (verbatim) | User input | User canonical | Stored canonical | Match |
|---|---|---|---|---|---|
| Oracle |
FOO (from non-delimited Foo) |
Foo |
FOO |
FOO |
โ |
| Oracle | FOO |
"Foo" |
Foo |
FOO |
โ |
| Oracle |
Foo (from delimited "Foo") |
Foo |
FOO |
Foo |
โ |
| Oracle | Foo |
"Foo" |
Foo |
Foo |
โ |
MySQL (COLUMN) |
ColumnA |
ColumnA |
columna |
columna |
โ |
MySQL (COLUMN) |
ColumnA |
columna |
columna |
columna |
โ |
MySQL (COLUMN) |
ColumnA |
"ColumnA" |
columna |
columna |
โ |
MySQL (COLUMN) |
ColumnA |
"columna" |
columna |
columna |
โ |
Catalog names are an engine-level concept. They are pinned to legacy lowercase canonicalization regardless of any connector. Extending support for non-lowercase catalog names is out of scope for this design and may be revisited later.
Role and user names follow the system security manager's rules (currently: lowercase). Session property names follow the SQL standard (upper case for non-delimited, preserve delimited).
These do not go through ConnectorMetadata.canonicalize; they are not connector-bound.
When the engine emits an identifier in formatted SQL output (SHOW CREATE TABLE, SHOW CREATE VIEW, SHOW CREATE MATERIALIZED VIEW, formatted plan output, etc.), it quotes a name iff:
!Identifier.isValidIdentifier(value)
|| canonicalize(value, kind, false) != canonicalize(value, kind, true)
The first clause handles names containing characters that require delimiting (spaces, punctuation, non-ASCII). The second captures whether non-delimited and delimited canonicalizations diverge for this name โ i.e., whether re-emitting it unquoted would parse-canonicalize back to the same key.
For SQL-compliant catalogs, the second clause reduces to "casing differs from canonical." For case-preserving / case-insensitive catalogs (MariaDB / MySQL columns), canonicalize(value, kind, false) == canonicalize(value, kind, true) always, so case alone never forces quoting โ names round-trip as the user wrote them.
This is a derived predicate over Canonicalizer.canonicalize and Identifier.isValidIdentifier. It is invoked from every SQL-formatting path that emits identifiers; the formatter knows which kind each emitted name has and passes the kind in.
Worked examples for a catalog with SQL-standard canonicalization (any kind):
| Stored name | canonicalize(v, k, false) |
canonicalize(v, k, true) |
Output |
|---|---|---|---|
ORDERS |
ORDERS |
ORDERS |
ORDERS |
Orders |
ORDERS |
Orders |
"Orders" |
orders |
ORDERS |
orders |
"orders" |
my table |
(not a valid identifier) | โ | "my table" |
1foo |
(not a valid identifier) | โ | "1foo" |
For a Hive-style catalog with lowercase canonicalization (any kind):
| Stored name | canonicalize(v, k, false) |
canonicalize(v, k, true) |
Output |
|---|---|---|---|
orders |
orders |
orders |
orders |
Orders |
orders |
Orders |
"Orders" |
ORDERS |
orders |
ORDERS |
"ORDERS" |
For a MariaDB-style catalog (identity for SCHEMA / TABLE, lowercase for COLUMN):
| Stored name | Kind | canonicalize(v, k, false) |
canonicalize(v, k, true) |
Output |
|---|---|---|---|---|
Orders |
TABLE |
Orders |
Orders |
Orders |
Orders |
COLUMN |
orders |
orders |
Orders |
orders |
COLUMN |
orders |
orders |
orders |
ORDERS |
COLUMN |
orders |
orders |
ORDERS |
The MariaDB COLUMN rows show the case-preserving property: the original case round-trips faithfully, even when it differs from the catalog's canonical lowercase, because non-delimited and delimited canonicalizations agree.
For UNION, INTERSECT, and EXCEPT with CORRESPONDING, column names from each arm of the set operation are matched by binary case-sensitive string equality on the names already stored in RelationType. No additional engine-side normalization is applied at match time.
This follows from the SQL standard's model. Identifiers are normalized at declaration time; by the time names reach RelationType, they are stored canonically and equivalence reduces to string equality.
The user-supplied <corresponding column list>, when present, consists of identifiers in the SQL text. Those go through the engine-internal path (SQL standard canonicalization, since column names in a <corresponding column list> are not catalog-bound) and are then compared by string equality against stored names.
When the two arms come from catalogs with different canonicalization rules, their stored names may genuinely differ (e.g., FOO from a SQL-compliant catalog vs. foo from a Hive-style catalog). The user resolves this with explicit aliases.
When the matched columns from the two arms have non-identical names, the result column takes the left arm's name. (The SQL standard leaves this implementation-defined.)
A connector that supports mixed-case identifiers:
- Returns names from its metadata APIs verbatim from the underlying database. The engine stores those names verbatim in SPI value classes.
- Implements
ConnectorMetadata.canonicalize(value, kind, delimited)according to its rules. The default implementation (SQL-standard upper case for non-delimited, identity for delimited) covers SQL-compliant connectors; others override. - Pushes to its underlying storage engine the right delimited / non-delimited form. JDBC connectors use
IdentifierMappingfor the round-trip; the engine no longer pre-normalizes identifiers passed to those mappings.
A connector that does not support mixed-case identifiers (e.g., Hive) overrides canonicalize to its single normalization (typically lowercase) and produces stored names in that form. The engine's behavior for such a connector is unchanged from today.
The SPI value classes (SchemaTableName, CatalogSchemaName, ColumnMetadata, ColumnSchema) preserve their input verbatim โ no toLowerCase in their constructors. Trino's general policy is to maintain SPI compatibility as much as possible to avoid breaking third-party connectors, but in this case the break is unavoidable: the goal of the change is precisely to stop the engine from re-canonicalizing names that the connector has already produced. In-tree connectors are updated as part of the same change set. Out-of-tree connectors that previously relied on the SPI implicitly lowercasing their inputs need to lowercase on their end (or otherwise produce names in a form consistent with the connector's canonicalize rules) before upgrading.
Existing session-property names that depend on the engine's case-folding behavior may need a transition window during which both the old case-folded form and the new SQL-standard form are accepted. This is addressed in a follow-up; the initial change preserves current session-property behavior where feasible.
The following are deliberately not addressed here and may be revisited as separate work:
- Non-lowercase catalog names. Catalogs remain pinned to lowercase canonicalization.
- Information Schema query semantics under mixed-case identifiers. The Information Schema reflects whatever the connector returns; changes to its query semantics are independent of this design.
- Engine-side case-insensitive string comparison support (issue #27347). This design does not address comparison semantics for
VARCHARdata; it only addresses identifiers. - Per-connector overrides for connectors that have not been audited (MongoDB, BigQuery, Snowflake, Redshift, MySQL beyond the MariaDB-shaped override). These can be added as their canonicalization rules are confirmed.
- Issue #17 โ tracking issue for case-sensitive identifier support.
-
Issue #27347 โ case-insensitive comparison support for
VARCHAR(related but orthogonal). -
PR #25260 โ
CORRESPONDINGset operations. - PR #25701, PR #27609 โ catalog name lowercase canonicalization.
- "SQL: 1999: Understanding Relational Language Components" by Jim Melton (Morgan Kaufmann, 2002), ยง3.2 "Identifiers", on the rationale for case-normal form storage.