Aggregating the consensus of putative ncRNAs predicted by the 3 tools - labbces/sugarcane_RNAome GitHub Wiki

Combining the consensus of putative ncRNAs predicted by the 3 tools

I have developed this python script to determine the consensus among predicted ncRNAs generated by CPC2, PLncPRO, and RNAplonc. Each of these tools identified 11,178,089, 8,952,956, and 9,894,831 putative ncRNAs, respectively, from a pool of 16,268,762 initial sequences. This script is executed by submit_findConsensus.sh.

Due to the large number of sequences, using list-based operations would not be efficient for finding the consensus sequences. Therefore, I employed the Python set() approach. The script converts the lists of identified ncRNAs from CPC2, PLncPRO, and RNAplonc into sets() and then finds the intersection of the three sets. While using sets, I maintained the order of identifiers as they appeared in the lists and the sequences of putative ncRNAs identified by CPC2. This preservation was essential for accelerating the extraction of sequences in the subsequent step.

Extracting the consensus sequences of putative ncRNAs

With the list of consensus putative ncRNAs at hand, I further developed this script to extract the consensus sequences from the 11,178,089 sequences identified by CPC2. Again, the script efficiently utilizes Python's set() approach to speed up the process of checking membership of consensus identifiers in the fasta file containing 11,178,089 sequences. This script is executed by submit_extractConsensus.sh.

The final outcome of this pipeline is the putative_ncRNA_consensus.fa file, containing 8,392,174 putative ncRNAs identified by CPC2, PLncPRO, and RNAplonc.