Inference of the pan‐ncRNAome - labbces/sugarcane_RNAome GitHub Wiki
Inference of the pan‐ncRNAome
We have developed a pipeline to infer the pan-ncRNAome. The pipeline follows the steps outlined below:
Convert the MMseqs2 similarity list into a matrix similar to OrthoFinder
This step transforms the format of the cluster file generated by MMseqs2 into a format identical to OrthoFinder's clustering, which each cluster is listed on the rows, and the genotypes are represented in columns, with each cell containing the sequence ID of the genotype within each cluster. Like in this example.
This format conversion is performed with this script.
[!NOTE] Before running this script, it is necessary to carry out an intermediate step to keep only the genotype names in the second column of the clusters file generated by MMseqs2, using the following code:
awk -F '\t' 'BEGIN {OFS = FS} {split($2, arr, "_"); $2 = arr[1]; print}' DB_clust.tsv >> DB_clust_genotypeName.tsv
1. Before format conversion (original MMseqs2 clustering output format):
Co06022_k31_TRINITY_DN17309_c0_g2_i1 | Co06022_k31_TRINITY_DN17309_c0_g2_i1 |
---|---|
Co06022_k25_TRINITY_DN6264_c0_g2_i8 | Co06022_k25_TRINITY_DN6264_c0_g2_i8 |
Co06022_k25_TRINITY_DN6264_c0_g2_i8 | Co06022_k25_TRINITY_DN6264_c0_g2_i10 |
Co06022_k25_TRINITY_DN6264_c0_g2_i8 | Co06022_k25_TRINITY_DN6264_c0_g2_i4 |
Co06022_k25_TRINITY_DN6264_c0_g2_i8 | Co06022_k31_TRINITY_DN12248_c0_g2_i5 |
2. Intermediate step (MMseqs2 clustering output format with only genotype names):
Co06022_k31_TRINITY_DN17309_c0_g2_i1 | Co06022 |
---|---|
Co06022_k25_TRINITY_DN6264_c0_g2_i8 | Co06022 |
Co06022_k25_TRINITY_DN6264_c0_g2_i8 | Co06022 |
Co06022_k25_TRINITY_DN6264_c0_g2_i8 | Co06022 |
Co06022_k25_TRINITY_DN6264_c0_g2_i8 | Co06022 |
3. After format conversion (output is a OrthoFinder likely matrix):
Orthogroup | B1 | B2 | CP74-2005 | Co06022 | group_name |
---|---|---|---|---|---|
OG1 | 0 | 0 | 0 | 1 | OG1 |
OG2 | 0 | 0 | 0 | 4 | OG2 |
Manually remove the last column (group_name) from the file DB_clust_groups.tsv
This step is necessary to avoid including duplicate information regarding the cluster names. See this example.
cut -f1-49 DB_clust_groups.tsv >> DB_clust_groups_withoutLastColumn.tsv
Calculate pan-ncRNAome classes (pan, exclusive, accessory, soft-core and hard-core genes)
We developed a python script that categorizes each gene into the following classes:
- Pan (sum of total classes)
- Exclusive (genes with only one genotype present)
- Accessory (genes with more than one genotype up to 80% of genotypes present)
- Soft-core (genes with 80% of genotypes present)
- Hard-core (genes with 100% of genotypes present)
[!NOTE] This bash script was used to run the script to calculate the pan-ncRNAome classes from the above reformatted similarity matrix, generating this file.
Run the script to plot the pan-ncRNAome
This script generates strip plots of the genes classes, with the number of genes represented on the Y-axis and the number of genotypes on the X-axis.
At the end of the pipeline, we obtain a representation of the group classes as follows:
[!NOTE] The pan-RNAome comprises
3,407,188 genes
, of which1,377,852 (5,914,246 transcripts) are accessory
, present in 2 to 38 (80%) of the genotypes. Additionally, there are2,029,324 genes (2,476,367 transcripts) classified as exclusive
, unique to a single genotype. Furthermore,12 genes (1,561 transcripts) are classified as soft-core
, present in more than 38 (>80%) genotypes. Notably,no genes were categorized as hard-core
, present in all evaluated genotypes.