DuckDB - ghdrako/doc_snipets GitHub Wiki

Query json

https://www.pgrs.net/2024/03/21/duckdb-as-the-new-jq/

Project with duckdb

https://github.com/davidgasquez/awesome-duckdb

% curl 'https://api.github.com/orgs/golang/repos' > repos.json
% duckdb -c \
  "select license->>'key' as license, count(*) as count \
  from 'repos.json' \
  group by 1 \
  order by count desc"
┌──────────────┬───────┐
│   license    │ count │
│   varchar    │ int64 │
├──────────────┼───────┤
│ bsd-3-clause │    23 │
│ apache-2.0   │     5 │
│              │     2 │
└──────────────┴───────┘

DuckDB is an open source in-process SQL OLAP database management system. DuckDB can be thought of as “SQLite for analytics” - you can embed it in virtually any codebase and run it in virtually any environment with minimal complexity.

As an in-process database, DuckDB is a storage and compute engine that enables developers, data scientists and data analysts to power their code with extremely fast analyses using plain SQL. Additionally, DuckDB can analyze data wherever it lives, be it on your laptop or in the cloud.

DuckDB functions as a library linked directly to the application process, boasting a small footprint and no dependencies and allowing developers to easily integrate a SQL engine for analytics.

command-line client

DuckDB comes with a command-line interface for rapid prototyping, and you can try DuckDB right now using the hosted DuckDB shell.

SELECT AVG(trip_distance) FROM 's3://yellow_tripdata_20[12]*.parquet'
SELECT * FROM '~/local/files/file.parquet'
SELECT * FROM dataframe
SELECT * FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet'

Arrow tables, PostgreSQL databases or SQLite databases - DuckDB can directly query those too; no import required!

SELECT * FROM userdata.parquet;

If a table with a specific name is not found, we search for other entities with that name, such as a Parquet file, directly executing queries on it.

Python

pip install duckdb

DuckDB can even treat pandas dataframes as DuckDB tables and query them directly.

import pandas as pd
import duckdb
mydf = pd.DataFrame({'a' : [1, 2, 3]})
print(duckdb.query("SELECT sum(a) FROM mydf;").fetchall())

import duckdb
duckdb.sql('LOAD httpfs')
duckdb.sql("SELECT * FROM 'https://github.com/duckdb/duckdb/blob/master/data/parquet-testing/userdata1.parquet'").df()

DuckDB is imported as a Python package, an extension is loaded to enable communication with HTTPS resources, and a Parquet file is read from a URL and converted back to a Panda DataFrame (DF).

The LOAD httpfs call illustrates how DuckDB can be expanded with plugins.

Integration

DuckDB enables users to connect to powerful BI tools like Tableau, Looker, or Superset with standard ODBC or JDBC drivers. Additionally, DuckDB is available in Python, R, Javan, node.JS, Julia, C/C++, and WASM.

DuckDB is optimized for read operations and can also perform updates in a transactional ACID-compliant fashion. It stores data in a compressed columnar format, which provides the best performance for large-scale aggregations.

DuckDB has a vectorized query engine, enabling small batches of data to be analyzed simultaneously via processors supporting SIMD (Simultaneous Instruction on Multiple Data). These small batches are optimized for locality to the CPU, utilizing the L1/L2/L3 caches which have the lowest latency, as opposed to only using main memory.

The SQL engine is extremely thoroughly tested and aims to support PostgreSQL-style SQL, along with some special analytical functions and custom syntax that’s helpful for analysts. You get window functions, statistical sampling, a good math library, and even support for working with nested data.