Clustering sequences by identity - labbces/sugarcane_RNAome GitHub Wiki
Clustering sequences by identity
To cluster the 8,392,174
putative lncRNAs identified previously by CPC2, PLncPRO and RNAplonc, we used MMseqs2. MMseqs2's clustering module is highly efficient at grouping similar sequences into clusters and has a faster execution time than other tools, such as CD-HIT.
To clusterize the 8,392,174
putative lncRNAs with MMseqs2, it was necessary to create a database using the mmseqs createdb function
. Then, the clustering of the database was executed using the mmseqs cluster
function, followed by running the mmseqs createtsv
function to generate a TSV formatted output file from the results.
You can adjust the sequence identity threshold with --min-seq-id
and the alignment coverage with -c
and --cov-mode
, as well as the sensitivity parameter -s
used for prefiltering and the --cov-mode
to control the sequence length overlap "coverage."
We performed the clustering with the following parameters using this script.
mmseqs cluster --threads 128 -s 5.7 --cov-mode 2 --cluster-mode 2 -c 0.8 --min-seq-id 0.8 DB DB_clust /work/fvperes
Read more about clustering databases using mmseqs cluster.
The DB_clust.tsv
file follows the following format:
#cluster-representative cluster-member
Q0KJ32 Q0KJ32
Q0KJ32 C0W539
Q0KJ32 D6KVP9
Q0KJ32 D1Y890
E3HQM9 E3HQM9
E3HQM9 F0YHT8
Each cluster in the file is represented by a consecutive block, with all its members listed line by line. The first column always contains the representative sequence, while the second column contains the cluster members' sequences. For example, the cluster represented by the sequence Q0KJ32 contains four members: C0W539, D6KVP9, and D1Y890. The IDs are parsed from the headers of the input database.
Clustering results (preliminary)
Of the 8,392,174
input sequences, MMseqs2 generated 3,407,188
clusters.
cat DB_clust.tsv | cut -f1 | sort -u | wc -l