ClickHouse - w4111/w4111.github.io GitHub Wiki

FOLLOW THIS TEMPLATE BY COPYING IT INTO YOUR OWN PAGE

Up to five students can be part of a team that contributes to a page. List the UNI and names of your team members and what each person contributed in sufficient detail that the staff can identify your contributions.

hs3455
- ALL

ClickHouse

The Problem and Solution

The Problem:

Modern analytical workloads require querying large datasets in near real-time. Traditional relational row-oriented database, such as MySQL and PostSQL, often struggle with slow query performance for analytics due to the lack of columnar storage and optimizations tailored for analytical queries. In a row-oriented database, even if a query only requires a few specific columns, the system must still load data from all columns in the same rows from disk into memory. This happens because data on disk is stored in chunks called blocks. These blocks are the smallest units of data that can be read from disk to memory. When data is requested, the disk I/O subsystem reads the entire block containing the needed data, even if only a portion of that block is relevant to the query.

The Solution:

ClickHouse solves this issue through a column-oriented database design. In this system, data is stored sequentially by columns, meaning all the values for a single column are grouped together. While this structure makes reconstructing individual rows more challenging due to the scattered storage of row values, it significantly enhances the performance of column-based operations like filtering and aggregation, as only the relevant column data needs to be accessed.

Alternatives and Comparison:

Alternatives: MySQL, PostgreSQL, Snowflake, AWS Redshift, Google BigQuery.

Pros and Cons of ClickHouse:

Better scalability and superior performance for analytical queries compared to traditional row-oriented databases such as MySQL and PostgreSQL.
Faster queries, Better compression, and Reduction in costs compare to other column-oriented databases, including Snowflake, AWS Redshift, and Google BigQuery.
Open-source
ClickHouse is bad at update-heavy data compared to traditional database

How it Relates to Concepts from 4111

Model architectures:

While relational databases like PostgreSQL use row-based storage, ClickHouse uses columnar storage, enabling faster query execution for analytical queries.

Query Optimization:

ClickHouse’s query optimization relies heavily on its columnar architecture. It minimizes disk I/O by reading only the required columns and employs compression to reduce data size. ClickHouse offers mechanisms to trade off accuracy for improved performance.

SQL Features:

ClickHouse supports SQL language for data manipulation.

Query Execution:

Query execution is vectorized and parallelized to improve efficiency by utilizing all possible resources.

Tutorial

Example

The operation of ClickHouse is basically the same as SQL we learned in the w4111 lectures. You could try it on ClickHouse PlayGround(https://sql.clickhouse.com/) to test its performance. Here, I demonstrated the efficiency of ClickHouse on the aggregation function Count().

Tutorial

It can be observed that it optimizes the query automatically so it doesn't have to iterate through the whole table. And it only takes 0.029s to return the result on such a big dataset