Clustering sequences by identity - labbces/sugarcane_RNAome GitHub Wiki
Clustering sequences by identity
To cluster the 8,392,174 putative lncRNAs identified previously by CPC2, PLncPRO and RNAplonc, we used MMseqs2. MMseqs2's clustering module is highly efficient at grouping similar sequences into clusters and has a faster execution time than other tools, such as CD-HIT.
To clusterize the 8,392,174 putative lncRNAs with MMseqs2, it was necessary to create a database using the mmseqs createdb function. Then, the clustering of the database was executed using the mmseqs cluster function, followed by running the mmseqs createtsv function to generate a TSV formatted output file from the results.
[!NOTE] You can adjust the sequence identity threshold with
--min-seq-idand the alignment coverage with-cand--cov-mode, as well as the sensitivity parameter-sused for prefiltering and the--cov-modeto control the sequence length overlap "coverage."
We performed the clustering with the following parameters using this script.
mmseqs cluster --threads 128 -s 5.7 --cov-mode 2 --cluster-mode 2 -c 0.8 --min-seq-id 0.8 DB DB_clust /work/fvperes
[!NOTE] Read more about clustering databases using mmseqs cluster.
The DB_clust.tsv file follows the following format:
#cluster-representative 	cluster-member
Q0KJ32	Q0KJ32
Q0KJ32	C0W539
Q0KJ32	D6KVP9
Q0KJ32	D1Y890
E3HQM9	E3HQM9
E3HQM9	F0YHT8
Each cluster in the file is represented by a consecutive block, with all its members listed line by line. The first column always contains the representative sequence, while the second column contains the cluster members' sequences. For example, the cluster represented by the sequence Q0KJ32 contains four members: Q0KJ32, C0W539, D6KVP9, and D1Y890. The IDs are parsed from the headers of the input database.
Clustering results
Of the 8,392,174 input sequences, MMseqs2 generated 3,407,188 clusters.
cat DB_clust.tsv | cut -f1 | sort -u | wc -l

[!NOTE] All 8,392,174 transcripts classified as ncRNA were grouped based on sequence identity into 3,407,188 groups, referred to as
genes. Of these transcripts, 2,572,290 (30.65%) showed evidence of being translatable to proteins, according to the software TransDecoder, which was used in the previous study on protein-coding gene pan-transcriptome. Consequently, the transcript grouping resulted in three types of genes, composed exclusively of transcripts identified as lncRNAs, totaling827,940 genes (A), composed of ncRNA or lncRNA transcripts, totaling1,341,162 genes (B)and composed of ncRNA, lncRNA, or protein-coding transcripts, totaling1,238,086 genes (C).