MonetDB - w4111/w4111.github.io GitHub Wiki

FOLLOW THIS TEMPLATE BY COPYING IT INTO YOUR OWN PAGE

Up to five students can be part of a team that contributes to a page. List the UNI and names of your team members, and what each person contributed in sufficient detail that the staff can identify your contributions.

jk4864 Jeewon Kim
- wrote the template, solve, relationship with 4111, and tutorial.
sk3394 Syklar jung
- Wrote the problem, alternatives, and tutorial feedback.

MonetDB - Column-store database management system

The Problem and Solution

1. The Problem MonetDB Addresses

Traditional row-based storage systems (e.g., MySQL, SQLite) read all columns of a row even if only a subset of columns is required for a query. For instance, if we only need two columns out of ten, a row-store will still read all ten, resulting in up to five times more data being processed than necessary. This inefficiency becomes more pronounced as the volume of data grows. By analogy, consider a scenario involving nested loop joins (NLJ)[from Lecture]: a row-based execution might require reading m + mn units of data, while a column-oriented approach might reduce it to roughly m + m(n/5) if only one-fifth of the columns are needed. Although other column-store systems exist—DuckDB, for example—DuckDB currently shines mainly at small to medium scales. It still faces challenges in parallel processing and may struggle to maintain top performance at large data warehouse scales.

2. How MonetDB Solves the Problem

As a column-oriented database, MonetDB overcomes the core drawbacks of row-based systems by reading only the needed columns. This approach significantly reduces I/O overhead. Beyond simply being a column-store, MonetDB introduces several advanced features—vertical fragmentation, CPU-tuned execution, tactical optimizers, adaptive indexing—that enable it to remain efficient even when dealing with very large datasets. These optimizations make MonetDB adept at handling large-scale data warehouses and complex analytical workloads. While DuckDB also benefits from a columnar format, MonetDB’s depth of optimizations often leads to better performance in massive, parallelized computing environments.

For more details, refer to MonetDB’s documentation: https://www.monetdb.org/documentation-Aug2024/user-guide/introduction-to-monetdb/key-concepts/

3. Alternatives and Comparative Pros & Cons

Alternatives:

Traditional row-based databases (MySQL, SQLite) are simpler but inefficient for analytical queries due to excessive I/O.
DuckDB, also column-based, is excellent for small to medium datasets and easy in-memory analytics, but it may not always scale as well to massive parallel workloads.

Pros of MonetDB:

Because it’s column-oriented, MonetDB retrieves only the necessary columns, significantly improving efficiency for analytical queries.
It excels in parallel execution environments, leveraging CPU-tuned execution and advanced optimizers to deliver consistently high performance, especially on powerful multi-core CPUs. •
Its design is well-suited to large-scale data processing, making it ideal for big data analytics and data warehouse environments.

Cons of MonetDB

Insert operations may be slower compared to row-based systems, as updating columnar storage can be more complex.
For smaller or medium-sized datasets, alternatives like DuckDB may occasionally outperform MonetDB. This is reminiscent of the idea that certain optimizers (e.g., Selinger-optimizer [from Lecture]) may not always choose the absolute best plan for smaller workloads. Thus, for less complex or smaller datasets, DuckDB might sometimes be the better choice. [the example of this will be given in tutorial]

Tutorial

basic tutorial How to use (Installation instructions are in the GitHub link at the very bottom)

Start new farm shell> monetdbd create /path/to/mydbfarm

shell> monetdbd start /path/to/mydbfarm

Create and release a Database (default in maintenance mode)

shell> monetdb create yourdbname shell> monetdb release yourdbname

Connect to the database with the following credentials:

shell> mclient -u monetdb -d yourdbname

default user: monetdb

default password: monetdb

Examples

Example motivated from Crime-insight (Jeewon project) [The content and type of data is a little different]

Results are pure query traversal time, excluding insert time. More details and reasons are mentioned above.

Example1: sqlite vs monetDB (Row-based vs column-based)

[Source code: compare_sqlite.py]

Example1:

DATA for Example 1

[num_rows = 100000]

example1_data

Example 1 query:"SELECT State, AVG(Severity) AS Avg_Severity FROM crime GROUP BY State"

Example 1 Result:

example1_result

Group by aggregation with large data

MONET DB is 14 times faster than Sqlite.

Example2:duckDB vs monetDB (Column-based vs Column-based)

[Sourcecode: compare.py]

DATA for Example 2

[num_rows = 100000]

example2_data

Example 2 query: "SELECT State, AVG(Severity) AS Avg_Severity FROM crime GROUP BY State"

Example 2 Result:

example2_result

Group by aggregation with large data

MONET DB is 2.41 times faster than Duck DB.

Github Link for tutorial

Monetdb Tutorial with example source code: https://github.com/gitgutgit/monetdbGuide