Clustering - bbuchfink/diamond GitHub Wiki

DIAMOND clusters protein sequences analogous to CD-HIT or UCLUST based on a user-defined clustering criterion, finding a set of representative sequences (also called centroids) and assigning each input sequence to the cluster of one representative such that the clustering criterion vs. the representative is fulfilled. The clustering criterion is defined by sequence coverage of the local alignment as well as its sequence identity (see below). Note that due to the heuristic nature of the cascaded clustering algorithm, these cutoff values serve to guide the computation, but their fulfillment is not always guaranteed, unless the recluster workflow is used (see below).

Basic command line example:

# fast clustering with linear scaling
diamond linclust -d INPUT_FILE -o OUTPUT_FILE --approx-id 30 -M 64G
# sensitive clustering using all-vs-all alignment
diamond cluster -d INPUT_FILE -o OUTPUT_FILE --approx-id 30 -M 64G

The linclust workflow is faster and more efficient, but less sensitive. Both workflows can be used to cluster at any sequence identity threshold. For clustering very large datasets at higher identities of >50% the linclust workflow is particularly recommended as clustering based on all-vs-all alignment can become very expensive here.

When using the clustering feature, please cite:

Buchfink B, Ashkenazy H, Reuter K, Kennedy JA, Drost HG, "Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust", bioRxiv 2023.01.24.525373; doi: https://doi.org/10.1101/2023.01.24.525373

`cluster`/`linclust` workflow

Cluster an input database of protein sequences.

--database/-d

The input sequence database. Supported formats are FASTA and DIAMOND (.dmnd) format.
--out/-o

Output file. This is a 2-column tabular file with the representative accession as the first column and the member sequence accession as the second column. More elaborate output can be retrieved using the realign workflow.
--header

Enable a header line in the output file.
--memory-limit/-M #

Set a memory limit for the diamond process (for example: -M 64G). This is not a hard upper limit and may still be exceeded in certain cases. Decrease this number in case the tool fails due to running out of memory. Note that higher numbers increase the performance by a lot, so it is strongly recommended to always set this option. Note that this option affects the algorithm and therefore the results. Clustering is a heuristic procedure with no unique solution. Note that higher numbers also increase the use of temporary disk space.
--approx-id #

The identity cutoff for the clustering (in %). Note that for performance reasons, the setting refers to the approximate sequence identity derived as a linear regression from the bitscore, not the actual number of identities in the alignment. The default value is 90% when running diamond linclust, 50% when running diamond cluster and 0% when running diamond deepclust.
--member-cover #

The minimum coverage of the cluster member sequence by the representative (in %). This is a unidirectional coverage i.e. a minimum coverage of the representative is not required. The default is 80%.
--mutual-cover #

The minimum coverage of both the cluster member sequence and the cluster representative sequence (in %). This enables bi-directional coverage clustering and overrides the --member-cover option.
--no-block-size-limit

Do not limit the block size to recommended maximums.
--cluster-steps

Set the sequence of clustering rounds for cascaded clustering as a space-separated list. Permitted keywords are the sensitivity switches of the alignment workflow (e.g. sensitive). The suffix _lin can be appended to trigger linearization of the search (e.g. faster_lin fast default sensitive very-sensitive). When missing, this parameter is automatically chosen based on the --approx-id parameter.

`realign` workflow

Given a clustering computed by the cluster workflow as input, this workflow computes alignments of all sequences in the original database against their assigned representative sequences.

--clusters The clustering as 2-column tabular format.
--outfmt/-f Set the output format. Only tabular format is supported for this workflow. The default corresponds to the format -f 6 qseqid sseqid approx_pident qstart qend sstart send evalue bitscore of the alignment workflow, where the query and subject correspond to the representative and the cluster member sequence respectively.

These parameters of the cluster workflow apply accordingly: --database/-d, --out/-o, --header, --memory-limit/-M.

`recluster` workflow

Fixes errors in a given clustering where a cluster member sequence does not satisfy the clustering criterion against its representative. Such errors may arise due to the heuristic nature of the cascaded clustering algorithm due to the merging of clusters based on alignments of their representative sequences.

These parameters of the cluster workflow apply accordingly: --database/-d, --out/-o, --header, --memory-limit/-M, --approx-id, --no-block-size-limit, --member-cover.

`reassign` workflow

For a given clustering, attempts to reassign all non-representative sequences to the closest representative sequence that satisfies the clustering criterion as measured by the e-value of the local alignment.

These parameters of the cluster workflow apply accordingly: --database/-d, --out/-o, --header, --memory-limit/-M, --approx-id, --no-block-size-limit, --member-cover.

`greedy-vertex-cover` workflow

Compute greedy vertex cover clustering based on alignment input.

--edges Input file containing alignments/graph edges for clustering. By default, a TSV file with 5 columns is expected: query target query-cover target-cover edge-weight.
--database/-d A TSV file whose first column needs to be a list of all accessions that occur in the edges file as either query or target. This must not be a sequence database file.
--edge-format (triplet) Enable triplet edge format: query target edge-weight. The semantic is unidirectional representation of the query by the target.
--centroid-out Output file for representative list.
--out/-o The output clustering as a 2-column TSV format. This file does not group clusters together.

These parameters of the cluster workflow apply accordingly: --header, --member-cover.

Alignment options

Many (but not all) options of the alignment workflow can also be used for the clustering workflows, e.g. --threads/-p, --evalue/-e.

Clustering - bbuchfink/diamond GitHub Wiki

cluster/linclust workflow

realign workflow

recluster workflow