File formats - bbuchfink/diamond GitHub Wiki

Parquet

For making use of the DIAMOND output in the context of big data analytics, we recommend the Apache Parquet file format and/or the DuckDB database system.

The DuckDB Command Line Interface can be used to convert DIAMOND tabular output format into Parquet or the DuckDB format, either using an intermediate TSV file or directly piping the output of DIAMOND into DuckDB. For this purpose, the DIAMOND tabular output format should be used with header lines (option --header simple), and without specifying an output file when using a pipe.

From TSV to Parquet:

duckdb -c "SET memory_limit='16GB'; SET threads=16; COPY(select * from read_csv_auto('input.tsv', delim='\t', header=true, parallel=true)) TO 'output.parquet' WITH (FORMAT 'PARQUET')"

From DIAMOND to Parquet:

diamond PARAMETERS | duckdb -c "SET memory_limit='16GB'; SET threads=16; COPY(select * from read_csv_auto('/dev/stdin', delim='\t', header=true, parallel=true)) TO 'output.parquet' WITH (FORMAT 'PARQUET')"

From TSV to DuckDB database:

duckdb DATABASE_NAME -c "SET memory_limit='16GB'; SET threads=16; create table alignments as select * from read_csv_auto('input.tsv', delim='\t', header=true, parallel=true)"

From DIAMOND to DuckDB Database:

diamond PARAMETERS | duckdb DATABASE_NAME -c "SET memory_limit='16GB'; SET threads=16; create table alignments as select * from read_csv_auto('/dev/stdin', delim='\t', header=true, parallel=true)"

The DuckDB memory limit and thread count may be changed depending on the system specs.

Benchmarks:

Size TSV to Parquet TSV to DuckDB database
12 GB 0m33.157s 0m30.596s
24 GB 1m4.645s 0m51.964s
48 GB 2m4.457s 1m37.319s
96 GB 3m59.649s 3m2.770s
192 GB 9m26.08s 7m20.66s

The benchmark was run on max. 20 cores in parallel. On a MacBook, it took 6 minutes to convert a 12 GB TSV file into Parquet.