ApacheCalcite - eellpp/pubScratchpad GitHub Wiki
ApacheCalcite contains many of the pieces that comprise a typical database management system, except :
- no storage of data,
- no algorithms to process data,
- no repository for storing metadata.
Tutorial : https://calcite.apache.org/docs/tutorial.html
Calcite can handle various data formats.
- in memory
- JDBC
- custom adapters can be added for other data formats. An adapter creates tables from data sources
Calcite provides optimisers for better query plan.
You can write optimizer rules allow you to register new operators (such as a better join algorithm), and allow Calcite to optimize how queries are translated to operators. Calcite will combine your rules and operators with built-in rules and operators, apply cost-based optimization, and generate an efficient plan.
Properties info = new Properties();
info.setProperty("lex", "JAVA");
This line sets a property in the Properties object. The key is "lex", and the value is "JAVA".
In the context of Apache Calcite, the lex property is used to specify the lexical policy, which determines how identifiers (like table names, column names, etc.) are quoted and interpreted. Setting lex to "JAVA" means that Calcite will use Java-like rules for parsing identifiers (e.g., case-sensitive identifiers).
In Apache Calcite, the Properties
object is used to configure various aspects of the SQL parser and other components. The lex
property is one of the most commonly used properties, but there are several other properties that can be set to customize the behavior of Calcite. Here are some of the common properties:
-
caseSensitive
:- Determines whether identifiers are case-sensitive.
- Values:
true
orfalse
. - Example:
info.setProperty("caseSensitive", "true");
-
quoting
:- Specifies the quoting style for identifiers.
- Values:
DOUBLE_QUOTE
,BACK_QUOTE
,BRACKET
, etc. - Example:
info.setProperty("quoting", "DOUBLE_QUOTE");
-
unquotedCasing
:- Determines how unquoted identifiers are treated in terms of casing.
- Values:
TO_UPPER
,TO_LOWER
,UNCHANGED
. - Example:
info.setProperty("unquotedCasing", "TO_UPPER");
-
quotedCasing
:- Determines how quoted identifiers are treated in terms of casing.
- Values:
TO_UPPER
,TO_LOWER
,UNCHANGED
. - Example:
info.setProperty("quotedCasing", "UNCHANGED");
-
conformance
:- Specifies the SQL conformance level (e.g., SQL standard compliance).
- Values:
DEFAULT
,STRICT_92
,STRICT_99
,PRAGMATIC_99
,ORACLE_10
,SQL_SERVER_2008
, etc. - Example:
info.setProperty("conformance", "STRICT_99");
-
parserFactory
:- Specifies the factory class for creating the SQL parser.
- Values: Fully qualified class name of the parser factory.
- Example:
info.setProperty("parserFactory", "org.apache.calcite.sql.parser.impl.SqlParserImplFactory");
-
fun
:- Specifies the initial catalog to use.
- Values: Catalog name.
- Example:
info.setProperty("fun", "my_catalog");
-
model
:- Specifies the path to the model file (JSON or YAML) that defines the schema.
- Values: Path to the model file.
- Example:
info.setProperty("model", "path/to/model.json");
-
typeSystem
:- Specifies the type system to use.
- Values: Fully qualified class name of the type system.
- Example:
info.setProperty("typeSystem", "org.apache.calcite.rel.type.DelegatingTypeSystem");
-
timeZone
:- Specifies the time zone to use for date/time operations.
- Values: Time zone ID (e.g., "UTC", "America/New_York").
- Example:
info.setProperty("timeZone", "UTC");
-
locale
:- Specifies the locale to use for string comparisons and other locale-sensitive operations.
- Values: Locale string (e.g., "en_US", "fr_FR").
- Example:
info.setProperty("locale", "en_US");
-
defaultNullCollation
:- Specifies how NULL values are sorted in ORDER BY clauses.
- Values:
FIRST
,LAST
,HIGH
,LOW
. - Example:
info.setProperty("defaultNullCollation", "LAST");
-
forceDecorrelate
:- Forces decorrelation of subqueries.
- Values:
true
orfalse
. - Example:
info.setProperty("forceDecorrelate", "true");
-
materializationsEnabled
:- Enables or disables the use of materialized views.
- Values:
true
orfalse
. - Example:
info.setProperty("materializationsEnabled", "true");
-
createMaterializations
:- Allows the creation of materialized views.
- Values:
true
orfalse
. - Example:
info.setProperty("createMaterializations", "true");
These properties can be set in the Properties
object passed to Calcite's configuration methods, such as when creating a SqlParser
or FrameworkConfig
. The exact set of properties you need to set will depend on your specific use case and requirements.
Apache Calcite provides ready-made adapters to integrate with third-party data sources, including Cassandra, Elasticsearch, MongoDB, to name a few
The Apache Calcite adapters provide classes, such as ElasticsearchSchemaFactory, MongoSchemaFactory, FileSchemaFactory, implementing the interface SchemaFactory. The SchemaFactory helps connect the underlying data sources in a unified manner by creating a virtual Schema defined in a JSON/YAML model file.
Calcite allows you to create custom table adapters to read data from various sources. You can implement a custom adapter that reads CSV files from S3 using the AWS SDK for Java.
Steps: Use the AWS SDK for Java (aws-java-sdk-s3) to interact with S3 and read the CSV file. Implement a TableFactory or Table interface in Calcite to expose the CSV data as a table. Parse the CSV file and convert it into Calcite's relational data model (e.g., Enumerable or ScannableTable).
Apache Calcite's relational data model is at the core of its functionality, enabling it to process SQL queries and perform optimizations. The relational data model in Calcite is based on the principles of relational algebra, which provides a mathematical foundation for representing and manipulating data in tables (relations). Here's a detailed explanation of Calcite's relational data model:
Calcite's relational data model revolves around the following key concepts:
- Relational algebra is a formal system for manipulating relations (tables). It consists of operations like:
-
Selection (
σ
): Filter rows based on a condition. -
Projection (
π
): Select specific columns. -
Join (
⨝
): Combine rows from two tables based on a condition. -
Union (
∪
), Intersection (∩
), Difference (-
): Set operations. -
Aggregation: Compute summaries (e.g.,
SUM
,COUNT
).
-
Selection (
- Calcite translates SQL queries into relational algebra expressions, which are then optimized and executed.
- Calcite represents relational algebra operations as relational operators (e.g.,
Filter
,Project
,Join
,Aggregate
). - These operators form a tree-like structure called a relational expression tree, which represents the logical plan of a query.
- Calcite supports standard SQL data types, such as:
- Scalar types:
INTEGER
,VARCHAR
,BOOLEAN
,DATE
,TIMESTAMP
, etc. - Composite types:
ARRAY
,MAP
,ROW
.
- Scalar types:
- These types are defined using the
RelDataType
interface.
- A table is a collection of rows and columns, where each row represents a record and each column represents an attribute.
- In Calcite, a table is represented by the
Table
interface. - Tables can be:
- Physical tables (e.g., from a database or file).
- Virtual tables (e.g., views or derived tables).
- A schema is a container for tables, views, and other database objects.
- In Calcite, a schema is represented by the
Schema
interface. - Schemas can be nested, allowing for hierarchical organization of tables.
- A relational expression (
RelNode
) represents an operation in relational algebra. - Examples:
-
TableScan
: Reads data from a table. -
Filter
: Applies a condition to filter rows. -
Project
: Selects specific columns. -
Join
: Combines rows from two tables. -
Aggregate
: Computes summaries (e.g.,SUM
,COUNT
).
-
- These expressions form a tree structure, representing the logical plan of a query.
- Defines the structure of a table or expression, including column names and types.
- Example: setting the datatype for a row in table
RelDataTypeFactory typeFactory = new SqlTypeFactoryImpl(RelDataTypeSystem.DEFAULT); RelDataType rowType = typeFactory.builder() .add("id", SqlTypeName.INTEGER) .add("name", SqlTypeName.VARCHAR) .add("age", SqlTypeName.INTEGER) .build();
- Traits represent physical properties of relational expressions, such as:
-
Convention: The calling convention (e.g.,
Enumerable
,Jdbc
). - Collation: The sort order of rows.
- Distribution: How data is distributed across nodes (e.g., in a distributed system).
-
Convention: The calling convention (e.g.,
- Traits are used during query optimization to ensure the plan meets certain physical requirements.
- When a SQL query is parsed, Calcite converts it into a logical plan (a tree of
RelNode
expressions). - Example:
Logical plan:
SELECT name, age FROM users WHERE age > 30;
Project(name, age) Filter(age > 30) TableScan(users)
- Calcite applies a series of optimization rules to the logical plan to improve performance.
- Examples of optimizations:
- Predicate pushdown: Push filters closer to the data source.
- Join reordering: Reorder joins to minimize intermediate results.
- Projection pruning: Remove unnecessary columns.
- After optimization, the logical plan is converted into a physical plan by applying traits (e.g.,
Enumerable
,Jdbc
). - The physical plan is executable and can be translated into code (e.g., Java, SQL).
- The physical plan is executed to produce the query results.
- Calcite supports multiple execution engines, including:
- Enumerable: Executes plans in memory using Java iterators.
- Jdbc: Executes plans using JDBC connections to databases.
- Custom adapters: Execute plans using custom data sources (e.g., CSV, S3).
SchemaPlus rootSchema = Frameworks.createRootSchema(true);
rootSchema.add("users", new AbstractTable() {
@Override
public RelDataType getRowType(RelDataTypeFactory typeFactory) {
return typeFactory.builder()
.add("id", SqlTypeName.INTEGER)
.add("name", SqlTypeName.VARCHAR)
.add("age", SqlTypeName.INTEGER)
.build();
}
});
FrameworkConfig config = Frameworks.newConfigBuilder()
.defaultSchema(rootSchema)
.build();
Planner planner = Frameworks.getPlanner(config);
SqlNode sqlNode = planner.parse("SELECT name, age FROM users WHERE age > 30");
SqlNode validatedNode = planner.validate(sqlNode);
RelNode logicalPlan = planner.rel(validatedNode).project();
RelOptPlanner optPlanner = logicalPlan.getCluster().getPlanner();
RelNode optimizedPlan = optPlanner.findBestExp();
Enumerable<Object[]> result = EnumerableInterpretable.toEnumerable(optimizedPlan);
for (Object[] row : result) {
System.out.println(Arrays.toString(row));
}
The line:
Enumerable<Object[]> result = EnumerableInterpretable.toEnumerable(optimizedPlan);
is performing the final step of query execution in Apache Calcite, where a relational algebra plan (optimizedPlan
) is being converted into an executable enumerable result set.
-
optimizedPlan
is aRelNode
(Relational Expression)- Before this step, a SQL query is parsed, validated, and converted into a relational expression (
RelNode
). - The
RelNode
(optimizedPlan
) has already been optimized by Calcite’s query optimizer.
- Before this step, a SQL query is parsed, validated, and converted into a relational expression (
-
EnumerableInterpretable.toEnumerable(...)
Interprets the Relational Plan-
EnumerableInterpretable
is a Calcite utility that converts a logical relational expression (RelNode
) into an executableEnumerable<Object[]>
. - This means Calcite is interpreting the query plan rather than compiling it into a more optimized execution plan.
-
-
Enumerable<Object[]>
is the Final Executable Result Set-
Enumerable
(from Calcite’s LINQ-like API) represents an iterable sequence of results. - Each row in the result set is represented as an
Object[]
(an array of objects), where:- Each element in the
Object[]
array corresponds to a column value in that row.
- Each element in the
-
If we compare this to a traditional database:
- SQL Query → Logical Plan (
RelNode
) → Optimized Plan (RelNode
) → Execution Plan (Enumerable<Object[]>
) → Results.
Let's say we have a table employees with columns (id
, name
, salary
). If we execute:
SELECT id, name FROM employees;
The following happens in Calcite:
-
SQL is parsed → Converted into a
SqlNode
. - SQL is validated → Checked against schema definitions.
-
SQL is converted into a relational plan (
RelNode
). - Optimization is applied (logical to physical plan transformation).
-
Execution happens via
EnumerableInterpretable.toEnumerable(optimizedPlan)
, returning:
Enumerable<Object[]> result = EnumerableInterpretable.toEnumerable(optimizedPlan);
The resulting Enumerable<Object[]>
could produce rows like:
[ [1, "Alice"], [2, "Bob"], [3, "Charlie"] ]
- When you want to execute an optimized relational algebra plan without compiling it into Java bytecode.
- Useful in scenarios like interactive query execution where an immediate result is needed without JIT (Just-In-Time) compilation overhead.
For better performance, Calcite can compile queries into bytecode instead of interpreting them, using EnumerableCompiler
. But for many use cases, EnumerableInterpretable
is sufficient.
- Flexibility: Can represent data from diverse sources (e.g., databases, files, streams).
- Extensibility: Supports custom operators, data types, and optimizations.
- Optimization: Provides a rich set of optimization rules for efficient query execution.
- Interoperability: Integrates with multiple execution engines and data sources.
Calcite accesses physical data only at execution time, specifically when iterating over the Enumerable<Object[]>
. This means that data access happens lazily, not during query compilation or optimization.
-
SQL Parsing, Validation, and Conversion to Relational Plan (
RelNode
)- No data access occurs at this stage.
- The SQL is transformed into an abstract relational algebra representation.
-
Optimization (Logical & Physical Plan Generation)
- Still no actual data access.
- The optimizer applies rule-based and cost-based transformations.
-
Execution: Generating
Enumerable<Object[]>
- If using
EnumerableInterpretable.toEnumerable()
, execution is still lazy. - The query is converted into an iterable execution plan but doesn’t fetch data yet.
- If using
-
Accessing Data During Enumeration (
for-each
loop)-
Physical data is accessed when iteration starts:
for (Object[] row : compiledQuery) { System.out.println(Arrays.toString(row)); }
- This triggers the underlying data access layer (e.g., JDBC, file system, in-memory storage).
- Data is read row-by-row, processed, and returned.
-
Physical data is accessed when iteration starts:
✅ Only when next()
is called on the Enumerable<Object[]>
iterator.
- This means data is retrieved on demand, not all at once.
If you have a large dataset, Calcite does not load all records into memory. Instead, it processes them as needed:
Enumerable<Object[]> result = EnumerableInterpretable.toEnumerable(optimizedPlan);
Iterator<Object[]> iterator = result.iterator();
while (iterator.hasNext()) { // Data is accessed lazily here
Object[] row = iterator.next();
System.out.println(Arrays.toString(row));
}
🔹 Data is not accessed at query compilation time.
🔹 Data access begins when you iterate over Enumerable<Object[]>
.
🔹 Execution is lazy, fetching only the required rows on demand.
🔹 This minimizes memory usage for large datasets.