Clustering - bbuchfink/diamond GitHub Wiki

Diamond clusters protein sequences analogous to CD-HIT or UCLUST based on a user-defined clustering criterion, finding a set of centroid or representative sequences and assigning each input sequence to the cluster of one representative such that the clustering criterion vs. the representative is fulfilled. The clustering criterion is defined by sequence coverage of the local alignment as well its sequence identity (see below). Note that due to the heuristic nature of the cascaded clustering algorithm, these cutoff values serve to guide the computation, but their fulfillment is not always guaranteed, unless the recluster workflow is used (see below).

Basic command line example:

diamond cluster -d INPUT_FILE -o OUTPUT_FILE --approx-id 30 -M 64G

When using the clustering feature, please cite:

  • Buchfink B, Ashkenazy H, Reuter K, Kennedy JA, Drost HG, "Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust", bioRxiv 2023.01.24.525373; doi: https://doi.org/10.1101/2023.01.24.525373

Cluster workflow

Cluster an input database of protein sequences.

  • --database/-d

    The input sequence database. Supported formats are FASTA and DIAMOND (.dmnd) format.

  • --out/-o

    Output file. This is a 2-column tabular file with the representative accession as the first column and the member sequence accession as the second column. More elaborate output can be retrieved using the realign workflow.

  • --header

    Enable a header line in the output file.

  • --memory-limit/-M #

    Set a memory limit for the diamond process (for example: -M 64G). This is not a hard upper limit and may still be exceeded in certain cases. Decrease this number in case the tool fails due to running out of memory. Note that higher numbers increase the performance by a lot, so it is strongly recommended to always set this option. Note that this option affects the algorithm and therefore the results. Clustering is a heuristic procedure with no unique solution. Note that higher numbers also increase the use of temporary disk space.

  • --approx-id #

    The identity cutoff for the clustering (in %). Note that for performance reasons, the setting refers to the approximate sequence identity derived as a linear regression from the bitscore, not the actual number of identities in the alignment. The default value is 50% when running diamond cluster and 0% when running diamond deepclust.

  • --member-cover #

    The minimum coverage of the cluster member sequence by the representative (in %). This is a unidirectional coverage i.e. a minimum coverage of the representative is not required. The default is 80%.

  • --no-block-size-limit

    Do not limit the block size to recommended maximums.

  • --cluster-steps

    Set the sequence of clustering rounds for cascaded clustering as a space-separated list. Permitted keywords are the sensitivity switches of the alignment workflow (e.g. sensitive). The suffix _lin can be appended to trigger linearization of the search (e.g. faster_lin fast default sensitive very-sensitive). When missing, this parameter is automatically chosen based on the --approx-id parameter.

realign workflow

Given a clustering computed by the cluster workflow as input, this workflow computes alignments of all sequences in the original database against their assigned representative sequences.

  • --clusters The clustering as 2-column tabular format.

  • --outfmt/-f Set the output format. Only tabular format is supported for this workflow. The default corresponds to the format -f 6 qseqid sseqid approx_pident qstart qend sstart send evalue bitscore of the alignment workflow, where the query and subject correspond to the representative and the cluster member sequence respectively.

These parameters of the cluster workflow apply accordingly: --database/-d, --out/-o, --header, --memory-limit/-M, --approx-id, --member-cover.

recluster workflow

Fixes errors in a given clustering where a cluster member sequence does not satisfy the clustering criterion against its representative. Such errors may arise due to the heuristic nature of the cascaded clustering algorithm due to the merging of clusters based on alignments of their representative sequences.

These parameters of the cluster workflow apply accordingly: --database/-d, --out/-o, --header, --memory-limit/-M, --approx-id, --no-block-size-limit, --member-cover.

reassign workflow

For a given clustering, attempts to reassign all non-representative sequences to the closest representative sequence that satisfies the clustering criterion as measured by the e-value of the local alignment.

These parameters of the cluster workflow apply accordingly: --database/-d, --out/-o, --header, --memory-limit/-M, --approx-id, --no-block-size-limit, --member-cover.

greedy-vertex-cover workflow

Compute greedy vertex cover clustering based on alignment input.

  • --edges Input file containing alignments/graph edges for clustering. By default, a TSV file with 5 columns is expected: query target query-cover target-cover edge-weight.

  • --database/-d A TSV file whose first column needs to be a list of all accessions that occur in the edges file as either query or target. This must not be a sequence database file.

  • --edge-format (triplet) Enable triplet edge format: query target edge-weight. The semantic is unidirectional representation of the query by the target.

  • --centroid-out Output file for representative list.

  • --out/-o The output clustering as a 2-column TSV format. This file does not group clusters together.

These parameters of the cluster workflow apply accordingly: --header, --member-cover.

Alignment options

Many (but not all) options of the alignment workflow can also be used for the clustering workflows, e.g. --threads/-p, --evalue/-e.