Predicting putative lncRNAs with RNAplonc - labbces/sugarcane_RNAome GitHub Wiki

Predicting non-coding sequences with RNAplonc

RNAplonc is a classifier designed to detect lncRNAs in plants using mRNA-based data. Developed and trained using lncRNA and mRNA data obtained from five plant species (Arabidopsis thaliana, Cucumis sativus, Glycine max, Populus trichocarpa and Oryza sativa), RNAplonc harnesses the power of 16 selected features from a pool of over 5,000 features, employing the REPTree algorithm for robust feature selection.

[!NOTE] The execution of RNAplonc consists of at least 6 steps (some of which are optional). I have developed a Snakemake pipeline to automate the execution of all these steps. Additionally, I added a step to split each of the datasets into 10 parts before running the RNAplonc pipeline (this was necessary to speed up the pipeline execution). At the end of the pipeline, the Snakemake pipeline automatically concatenates all the parts. The Snakemake pipeline was executed with this bash script.

Please refer to the RNAplonc User manual for detailed information about each step.

The following directed acyclic graph (DAG) represents the complete automated pipeline using Snakemake for the execution of RNAplonc and the extraction of putative non-coding RNAs.

RNAplonc_SnakefileDAG

Of the 11,178,089 transcripts classified as non-coding by CPC2, RNAplonc classified 9,894,831 (88.52%) as long non-coding.

Extracting RNAplonc non-coding sequences

RNAplonc classifies sequences as coding or lncRNAs, returning the identifier and label for each sequence. To obtain the sequences classified as lncRNA, I developed this simple python script. This script is the last step executed by the automated pipeline described above (rule extract_ncrnas).