Predicting putative lncRNAs with PLncPRO - labbces/sugarcane_RNAome GitHub Wiki
Predicting non-coding sequences with PLncPRO
PLncPRO is based on machine learning and uses random forest algorithm via constructing training model based on 71 features (10 from BLASTx output) to classify the coding and long non-coding transcripts. Check the PLncPRO User manual.
PLncPRO requires some features inferred by BLASTx: qseqid
, sseqid
, pident
, evalue
, nident
, qcovhsp
, score
, bitscore
, qframe
and qstrand
. These features was inferred by DIAMONDx (way faster than BLAST) using the following script: submitDiamondx.sh
[!NOTE] We also have to choose a suitable model to predict lncRNAs. PLncPRO reads an input file containing sequences and then classifies the sequences as coding or non-coding. The monocot model for lncRNA prediction was built using the build.py script from PLncPRO. For this purpose, files provided by the authors were used, containing protein-coding transcripts (monocot_pct_train.fa) and lncRNA sequences (monocot_lnct_train.fa) from Oryza sativa and Zea mays. That process outputs a file containing class label and class probabilities for each sequence. The monocot_model used in this project was generated with the following code:
plncpro build -p plncpro_data/plant_new_fasta/monocot/train/monocot_pct_train.fa -n plncpro_data/plant_new_fasta/monocot/train/monocot_lnct_train.fa -o monocot_model -m monocot.model -d /Storage/data1/felipe.peres/swissprot/uniprot/uniprotdb -t 1
Having both the monocot_model
and DIAMONDx results at my disposal, along with the lncRNAs previously identified as non-coding by CPC2 in the preceding step, I ran this script to predict lncRNAs with PLncPRO (which runs plncpro predict function
).
Of the 11,178,089
transcripts classified as non-coding by CPC2, PLncPRO classified 8,952,956 (80.09%)
as long non-coding.
Extracting PLncPRO non-coding sequences
PLncPRO has a script to read a prediction output file and extract sequences of a certain class (mRNA or lncRNA). The user can specify the class and probability cut-off and extract the desired transcript sequences. I wrote this script which runs plncpro predtoseq
and extract lncRNAs sequences from PLncPRO output file.
Note: use the appropriate parameter to extract predicted lncRNAs: -l = 0