使用duckdb 1.2.0 版excel插件读写xlsx文件 - l1t1/note GitHub Wiki
1.手工下载excel插件压缩包。
wget -c http://extensions.duckdb.org/5f5512b827/linux_arm64/excel.duckdb_extension.gz
因为duckdb升级1.2.0的bug,截至2月6日,直接在duckdb中install插件会报错。
The file was built specifically for DuckDB version '1b8c9023d0' and can only be loaded with that version of DuckDB. (this version of DuckDB is 'v1.2.0')
所以手工下载插件,将来修复后,只需在duckdb中install excel;
就能自动下载并安装。
2.安装并装载
D install 'excel.duckdb_extension.gz';
D load excel;
3.写入xlsx文件并读取
D copy (select 42 a) to 'ex.xlsx'(format 'xlsx', header 'true');
D select * from 'ex.xlsx';
┌────────┐
│ a │
│ double │
├────────┤
│ 42.0 │
└────────┘
4.性能测试 4.1 下载spatial和tpch插件压缩包。
wget -c http://extensions.duckdb.org/5f5512b827/linux_arm64/spatial.duckdb_extension.gz
wget -c http://extensions.duckdb.org/5f5512b827/linux_arm64/tpch.duckdb_extension.gz
4.2安装并装载
D install 'spatial.duckdb_extension.gz';
D load spatial;
D install 'tpch.duckdb_extension.gz';
D load tpch;
4.3生成tpch 1GB数据集
D call dbgen(sf=1);
4.4生成1048576行的测试表
D create table li as from lineitem limit 1048576;
4.5生成1048576行数据的xlsx文件
D .timer on
D copy (from li) to 'exli.xlsx'(format 'xlsx', header 'true'); --表头和数据超过了1048576行
Run Time (s): real 102.054 user 101.796000 sys 0.112000
Invalid Input Error:
XLSX: Sheet row limit of '1048576' rows exceeded!
* XLSX files and compatible applications generally have a limit of '1048576' rows
* You can export larger sheets at your own risk by setting the 'sheet_row_limit' parameter to a higher value
D select count(*) from 'exli.xlsx'; --xlsx文件已损坏
Run Time (s): real 1.297 user 1.016000 sys 0.132000
Binder Error:
No [Content_Types].xml found in xlsx file
D copy (from li limit 1048575) to 'exli2.xlsx'(format 'xlsx', header 'true'); --表头和数据正好1048576行
Run Time (s): real 102.721 user 205.616000 sys 0.668000
D COPY (select l_orderkey,l_partkey,l_suppkey,l_linenumber,
l_quantity::double l_quantity, --因spatial不支持DECIMAL(15,2)数据类型,故强制转换为double
l_extendedprice::double l_extendedprice,
l_discount::double l_discount,
l_tax::double l_tax,
l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment from li limit 1048575) to 'exli3.xlsx'(format 'GDAL', DRIVER 'xlsx');
Run Time (s): real 63.094 user 125.972000 sys 1.140000
写xlsx还是spatial插件更快,约快一倍。 4.6读取1048576行数据的xlsx文件
D select count(*) from 'exli2.xlsx';
┌────────────────┐
│ count_star() │
│ int64 │
├────────────────┤
│ 1048575 │
│ (1.05 million) │
└────────────────┘
Run Time (s): real 23.386 user 45.072000 sys 0.196000
D select count(*),l_shipmode from 'exli2.xlsx' group by l_shipmode; --excel插件
Run Time (s): real 23.275 user 23.652000 sys 0.176000
D select count(*),l_shipmode from li group by l_shipmode; --访问duckdb表
Run Time (s): real 0.012 user 0.040000 sys 0.000000
D select count(*),l_shipmode from st_read('exli2.xlsx') group by l_shipmode; --spatial插件
Run Time (s): real 56.204 user 55.652000 sys 0.756000
D select count(*),l_shipmode from 'exli3.xlsx' group by l_shipmode;
Run Time (s): real 20.908 user 21.848000 sys 0.064000
用excel插件读取的速度大约是spatial插件的2倍。但是与duckdb格式没法比,如果要多次读取数据分析,还是先转成duckdb格式为好。奇怪的是,读取spatial生成的xlsx文件更快一点。