Gene Ontology enrichment analysis: exploring biological function of modules - labbces/sugarcane_RNAome GitHub Wiki

Similar to how we can ease the visualization of module expression with heatmaps, it's also feasible to visualize the functional enrichment of proteins present in a co-expression module.

GO Universe

To analyze the functional enrichment of co-expressed modules, it is necessary to have the annotation of GO terms for the universe of proteins in the sugarcane pan-transcriptome.

The annotation of GO terms for the proteins in the pan-transcriptome was generated for each of the 48 genotypes with PANZZER2 using this script.

Filtering GO terms by Positive Predictive Value (PPV)

PPV is an estimate for the reliability of the predicted GO class. It indicates the likelihood of the GO class being correct or close to correct.

"Approximately correct" refers to the predicted GO class being in proximity to the correct GO class in the GO tree (either a child or parent node of the correct GO class). The relationship between PPV and the Argot score was calibrated using a training set of proteins with known correct annotations. Read more here.

$PPV = Good Stuff / (Good Stuff + Bad Stuff)$

"Good Stuff" represents the number of correct or approximately correct predictions with a similar Argot Score.

"Bad Stuff" represents the number of clearly incorrect predictions with a similar Argot Score.

Initially, I chose to use PPV > 0.6. With scores above 0.7, the GO universe becomes too small for the pan-transcriptome, and with values below 0.6, the predicted GO classes start to lose confidence.

I filtered the results from PANNZER2 based on PPV > 0.6 using this script.

The prediction of GO class is performed for proteins; as a result, this script also merges the annotation of all similar proteins into genes within the pan-transcriptome.

annotation=GO_universe_annotation_list
panrnaome=panTranscriptome_panRNAomeClassificationTable_hyphen_Class.tsv
out=GO_annotations
ontology=BP
threshold=(0.3 0.4 0.5 0.6 0.7 0.8 0.9)

for i in "${threshold[@]}"
do
        echo threshold = $i
        ./processGeneOntologyAnnotation.py --annotation $annotation -p $panrnaome -o $out -og $ontology -ppv $i
done

Functional enrichment analysis

I've developed a script to conduct a functional enrichment analysis of each co-expression module. This approach uses the Fisher's exact test, employing the topGO library for each module with co-expressed coding and non-coding genes.

Exploring the frequency of enriched GO terms in modules with long non-coding genes

The frequency of enriched GO terms in modules containing lncRNAs from the three datasets was calculated using this script, resulting in the following most frequent functions (top 70):

Hoang 2017

Correr 2020

Perlo 2022