File formats - bbuchfink/diamond GitHub Wiki

Parquet

For making use of the DIAMOND output in the context of big data analytics, we recommend the Apache Parquet file format and/or the DuckDB database system.

The DuckDB Command Line Interface can be used to convert DIAMOND tabular output format into Parquet or the DuckDB format, either using an intermediate TSV file or directly piping the output of DIAMOND into DuckDB. For this purpose, the DIAMOND tabular output format should be used with header lines (option --header simple), and without specifying an output file when using a pipe.

From TSV to Parquet:

duckdb -c "SET memory_limit='16GB'; SET threads=16; COPY(select * from read_csv_auto('input.tsv', delim='\t', header=true, parallel=true)) TO 'output.parquet' WITH (FORMAT 'PARQUET')"

From DIAMOND to Parquet:

diamond PARAMETERS | duckdb -c "SET memory_limit='16GB'; SET threads=16; COPY(select * from read_csv_auto('/dev/stdin', delim='\t', header=true, parallel=true)) TO 'output.parquet' WITH (FORMAT 'PARQUET')"

From TSV to DuckDB database:

duckdb DATABASE_NAME -c "SET memory_limit='16GB'; SET threads=16; create table alignments as select * from read_csv_auto('input.tsv', delim='\t', header=true, parallel=true)"

From DIAMOND to DuckDB Database:

diamond PARAMETERS | duckdb DATABASE_NAME -c "SET memory_limit='16GB'; SET threads=16; create table alignments as select * from read_csv_auto('/dev/stdin', delim='\t', header=true, parallel=true)"

The DuckDB memory limit and thread count may be changed depending on the system specs.

Benchmarks:

Size	TSV to Parquet	TSV to DuckDB database
12 GB	0m33.157s	0m30.596s
24 GB	1m4.645s	0m51.964s
48 GB	2m4.457s	1m37.319s
96 GB	3m59.649s	3m2.770s
192 GB	9m26.08s	7m20.66s

The benchmark was run on max. 20 cores in parallel. On a MacBook, it took 6 minutes to convert a 12 GB TSV file into Parquet.