Clustering sequences by identity - labbces/sugarcane_RNAome GitHub Wiki

Clustering sequences by identity

To cluster the 8,392,174 putative lncRNAs identified previously by CPC2, PLncPRO and RNAplonc, we used MMseqs2. MMseqs2's clustering module is highly efficient at grouping similar sequences into clusters and has a faster execution time than other tools, such as CD-HIT.

To clusterize the 8,392,174 putative lncRNAs with MMseqs2, it was necessary to create a database using the mmseqs createdb function. Then, the clustering of the database was executed using the mmseqs cluster function, followed by running the mmseqs createtsv function to generate a TSV formatted output file from the results.

You can adjust the sequence identity threshold with --min-seq-id and the alignment coverage with -c and --cov-mode, as well as the sensitivity parameter -s used for prefiltering and the --cov-mode to control the sequence length overlap "coverage."

We performed the clustering with the following parameters using this script.

mmseqs cluster --threads 128 -s 5.7 --cov-mode 2 --cluster-mode 2 -c 0.8 --min-seq-id 0.8 DB DB_clust /work/fvperes

Read more about clustering databases using mmseqs cluster.

The DB_clust.tsv file follows the following format:

#cluster-representative 	cluster-member
Q0KJ32	Q0KJ32
Q0KJ32	C0W539
Q0KJ32	D6KVP9
Q0KJ32	D1Y890
E3HQM9	E3HQM9
E3HQM9	F0YHT8

Each cluster in the file is represented by a consecutive block, with all its members listed line by line. The first column always contains the representative sequence, while the second column contains the cluster members' sequences. For example, the cluster represented by the sequence Q0KJ32 contains four members: C0W539, D6KVP9, and D1Y890. The IDs are parsed from the headers of the input database.

Clustering results (preliminary)

Of the 8,392,174 input sequences, MMseqs2 generated 3,407,188 clusters.

cat DB_clust.tsv | cut -f1 | sort -u | wc -l