analytics database DuckDB - ghdrako/doc_snipets GitHub Wiki
- https://duckdb.org
- https://www.infoq.com/articles/analytical-data-management-duckdb/
- https://www.youtube.com/watch?v=egN4TwVyJss
Query json
Project with duckdb
% curl 'https://api.github.com/orgs/golang/repos' > repos.json
% duckdb -c \
"select license->>'key' as license, count(*) as count \
from 'repos.json' \
group by 1 \
order by count desc"
ββββββββββββββββ¬ββββββββ
β license β count β
β varchar β int64 β
ββββββββββββββββΌββββββββ€
β bsd-3-clause β 23 β
β apache-2.0 β 5 β
β β 2 β
ββββββββββββββββ΄ββββββββ
DuckDB is an open source in-process SQL OLAP database management system. DuckDB can be thought of as βSQLite for analyticsβ - you can embed it in virtually any codebase and run it in virtually any environment with minimal complexity.
As an in-process database, DuckDB is a storage and compute engine that enables developers, data scientists and data analysts to power their code with extremely fast analyses using plain SQL. Additionally, DuckDB can analyze data wherever it lives, be it on your laptop or in the cloud.
DuckDB functions as a library linked directly to the application process, boasting a small footprint and no dependencies and allowing developers to easily integrate a SQL engine for analytics.
command-line client
DuckDB comes with a command-line interface for rapid prototyping, and you can try DuckDB right now using the hosted DuckDB shell.
SELECT AVG(trip_distance) FROM 's3://yellow_tripdata_20[12]*.parquet'
SELECT * FROM '~/local/files/file.parquet'
SELECT * FROM dataframe
SELECT * FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet'
Arrow tables, PostgreSQL databases or SQLite databases - DuckDB can directly query those too; no import required!
SELECT * FROM userdata.parquet;
If a table with a specific name is not found, we search for other entities with that name, such as a Parquet file, directly executing queries on it.
Python
pip install duckdb
DuckDB can even treat pandas dataframes as DuckDB tables and query them directly.
import pandas as pd
import duckdb
mydf = pd.DataFrame({'a' : [1, 2, 3]})
print(duckdb.query("SELECT sum(a) FROM mydf;").fetchall())
import duckdb
duckdb.sql('LOAD httpfs')
duckdb.sql("SELECT * FROM 'https://github.com/duckdb/duckdb/blob/master/data/parquet-testing/userdata1.parquet'").df()
DuckDB is imported as a Python package, an extension is loaded to enable communication with HTTPS resources, and a Parquet file is read from a URL and converted back to a Panda DataFrame (DF).
The LOAD httpfs
call illustrates how DuckDB can be expanded with plugins.
Integration
DuckDB enables users to connect to powerful BI tools like Tableau, Looker, or Superset with standard ODBC or JDBC drivers. Additionally, DuckDB is available in Python, R, Javan, node.JS, Julia, C/C++, and WASM.
DuckDB is optimized for read operations and can also perform updates in a transactional ACID-compliant fashion. It stores data in a compressed columnar format, which provides the best performance for large-scale aggregations.
DuckDB has a vectorized query engine, enabling small batches of data to be analyzed simultaneously via processors supporting SIMD (Simultaneous Instruction on Multiple Data). These small batches are optimized for locality to the CPU, utilizing the L1/L2/L3 caches which have the lowest latency, as opposed to only using main memory.
The SQL engine is extremely thoroughly tested and aims to support PostgreSQL-style SQL, along with some special analytical functions and custom syntax thatβs helpful for analysts. You get window functions, statistical sampling, a good math library, and even support for working with nested data.
Using duckdb to transfer data tp ostgresql
To import from a Salesforce Bulk Export CSV (for example) into a postgres table is a few lines of SQL
INSERT INTO pgdb.accounts ( id, created_at, updated_at ) SELECT "Id", "CreatedDate"::TIMESTAMP, "UpdatedDate"::TIMESTAMP FROM read_csv('/path/to/file.csv') WHERE "Id" NOT IN (SELECT id FROM pgdb.accounts)
The path can be in a cloud storage provider as well, which is really nice. You can do updates which join from the postgres db or the other CSV files (or a MySQL database) as well. The data transforms (casting to timestamp, uuid, etc.) have been super handy along with all the other SQL niceties that you get