Annotation of ncRNAs families - labbces/sugarcane_RNAome GitHub Wiki

Annotation of ncRNAs families

We employed Infernal in combination with the Rfam database to predict and annotate ncRNA families.

Rfam is a comprehensive database that encompasses multiple sequence alignments, consensus secondary structures, and covariance models (CMs) representing various classes of ncRNAs, including microRNAs, small nuclear RNAs (snRNAs), ribosomal RNAs (rRNAs), transfer RNAs (tRNAs), and other functional RNA molecules. These families are curated and classified based on experimental evidence and computational predictions.

Infernal utilizes Rfam's CM database to search for potential ncRNA family members within a given transcriptomic sequence. By scanning the input sequence using the CMs, Infernal detects regions that exhibit sequence and structural similarities to known ncRNAs. This approach enables us to identify and characterize potential ncRNA candidates present in the 8,392,174 putative lncRNAs consensus predicted by 3 tools.

I have developed a Snakemake pipeline to automate the execution of the annotation of ncRNAs with Infernal/Rfam approach. This pipeline requires the cmscan tool from Infernal and both Rfam.cm and Rfam.clanin models from the Rfam database.

[!NOTE] The Z-score for the putative ncRNA consensus is 12.423724.

# determine the total database size for the genome you are annotating.

../../../Infernal/bin/esl-seqstat putative_ncRNA_consensus.fa

Format:              FASTA
Alphabet type:       DNA
Number of sequences: 8392174
Total # residues:    6211862334
Smallest:            301
Largest:             16294
Average length:      740.2

# cmscan Z score:

(6.211.862.334*2)/1.000.000 = 12.423724

[!NOTE] To speed up the execution of Infernal, I splitted the putative ncRNA consensus input file into 1000 parts before the execution of the Snakemake pipeline. This split was performed using this fasta-splitter by the following code:

./fasta-splitter.pl --n-parts 1000 --out-dir ./../data/ --line-length --nopad putative_ncRNA_consensus.fa

[!NOTE] The Snakefile pipeline also requires the following files: genotypes.csv and parts.csv. The genotypes.csv file contains the names of the genotypes separated by commas, in this case, genotype=putative_ncRNA_consensus, while parts.csv contains the quantity of parts into which the genotypes are divided to accelerate the pipeline execution (1-1000). The parts.csv can be generated as below:

echo $(seq -s',' 1 1000) > parts.csv

This script was used to run the automated pipeline using Snakemake for the annotation of ncRNAs families with the Infernal/Rfam approach. At the end of this pipeline, the files cmscan.tblout and cmscan.out were generated for each part of the putative ncRNA consensus file.

Plotting results

I have developed this Python script to plot the heatmap of ncRNAs annotated by Infernal/Rfam. To run this script, you need to generate the following input based on the tblout files generated by Infernal:

# concatenate all parts identified by Infernal/Rfam
for i in `ls -1 *.tblout`; do cat $i >> putative_ncRNA_consensus.complete.cmscan.tblout; done

# removing lower-scoring overlaps from the tblout file, as described here: https://docs.rfam.org/en/latest/genome-annotation.html#removing-lower-scoring-overlaps-from-a-tblout-file
grep -v "#" putative_ncRNA_consensus.complete.cmscan.tblout | grep -v " = " > putative_ncRNA_consensus.complete.deoverlapped.cmscan.tblout

# extracting results for each genotype
# OBS: "samples_batch.txt" is a file with the names of all 48 genotypes, one per line
for i in $(cat samples_batch.txt)`;do grep $i putative_ncRNA_consensus.complete.deoverlapped.cmscan.tblout >> ${i}.complete.deoverlapped.cmscan.tblout; done

[!NOTE] Then generate the final file to plot the heatmap of ncRNAs running this bash script.

After these steps, we obtain a file in the following format:

# extractRfamTypes.py will generate a file containing Rfam families and types for each genotype, for example:
cat B1.complete.deoverlapped.cmscan.RfamTypes.tblout
Family type,Count
 miRNA,180
 CD-box,766
 HACA-box,156
enod40,14
 tRNA,28
 sRNA,2
Intron,9
 splicing,4
 riboswitch,1

For each genotype we have a complete.deoverlapped.cmscan.RfamTypes.tblout. These files was used as input to plot the following heatmap displaying the distribution of ncRNA families identified by Infernal/Rfam: