使用duckdb 1.2.0 版excel插件读写xlsx文件 - l1t1/note GitHub Wiki

1.手工下载excel插件压缩包。

wget -c http://extensions.duckdb.org/5f5512b827/linux_arm64/excel.duckdb_extension.gz

因为duckdb升级1.2.0的bug,截至2月6日,直接在duckdb中install插件会报错。

The file was built specifically for DuckDB version '1b8c9023d0' and can only be loaded with that version of DuckDB. (this version of DuckDB is 'v1.2.0')

所以手工下载插件,将来修复后,只需在duckdb中install excel;就能自动下载并安装。

2.安装并装载

D install 'excel.duckdb_extension.gz';
D load excel;

3.写入xlsx文件并读取

D copy (select 42 a) to 'ex.xlsx'(format 'xlsx', header 'true');
D select * from 'ex.xlsx';
┌────────┐
│   a    │
│ double │
├────────┤
│  42.0  │
└────────┘

4.性能测试 4.1 下载spatial和tpch插件压缩包。

wget -c http://extensions.duckdb.org/5f5512b827/linux_arm64/spatial.duckdb_extension.gz
wget -c http://extensions.duckdb.org/5f5512b827/linux_arm64/tpch.duckdb_extension.gz

4.2安装并装载

D install 'spatial.duckdb_extension.gz';
D load spatial;
D install 'tpch.duckdb_extension.gz';
D load tpch;

4.3生成tpch 1GB数据集

D call dbgen(sf=1);

4.4生成1048576行的测试表

D create table li as from lineitem limit 1048576;

4.5生成1048576行数据的xlsx文件

D .timer on
D copy (from li) to 'exli.xlsx'(format 'xlsx', header 'true'); --表头和数据超过了1048576行
 Run Time (s): real 102.054 user 101.796000 sys 0.112000
Invalid Input Error:
XLSX: Sheet row limit of '1048576' rows exceeded!
 * XLSX files and compatible applications generally have a limit of '1048576' rows
 * You can export larger sheets at your own risk by setting the 'sheet_row_limit' parameter to a higher value
D select count(*) from 'exli.xlsx'; --xlsx文件已损坏
Run Time (s): real 1.297 user 1.016000 sys 0.132000
Binder Error:
No [Content_Types].xml found in xlsx file


D copy (from li limit 1048575) to 'exli2.xlsx'(format 'xlsx', header 'true'); --表头和数据正好1048576行
 
Run Time (s): real 102.721 user 205.616000 sys 0.668000

D COPY (select l_orderkey,l_partkey,l_suppkey,l_linenumber,
  l_quantity::double l_quantity, --因spatial不支持DECIMAL(15,2)数据类型,故强制转换为double
  l_extendedprice::double l_extendedprice,
  l_discount::double l_discount,
  l_tax::double l_tax,
  l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment from li limit 1048575) to 'exli3.xlsx'(format 'GDAL', DRIVER 'xlsx');
Run Time (s): real 63.094 user 125.972000 sys 1.140000

写xlsx还是spatial插件更快,约快一倍。 4.6读取1048576行数据的xlsx文件

D select count(*) from 'exli2.xlsx';
┌────────────────┐
│  count_star()  │
│     int64      │
├────────────────┤
│    1048575     │
│ (1.05 million) │
└────────────────┘
Run Time (s): real 23.386 user 45.072000 sys 0.196000
D select count(*),l_shipmode from 'exli2.xlsx' group by l_shipmode; --excel插件

Run Time (s): real 23.275 user 23.652000 sys 0.176000
D select count(*),l_shipmode from li group by l_shipmode; --访问duckdb表

Run Time (s): real 0.012 user 0.040000 sys 0.000000
D select count(*),l_shipmode from st_read('exli2.xlsx') group by l_shipmode; --spatial插件

Run Time (s): real 56.204 user 55.652000 sys 0.756000

D select count(*),l_shipmode from 'exli3.xlsx' group by l_shipmode;

Run Time (s): real 20.908 user 21.848000 sys 0.064000

用excel插件读取的速度大约是spatial插件的2倍。但是与duckdb格式没法比,如果要多次读取数据分析,还是先转成duckdb格式为好。奇怪的是,读取spatial生成的xlsx文件更快一点。