Finding conserved modules of co‐expression (complete subgraphs ‐ cliques) - labbces/sugarcane_RNAome GitHub Wiki

The clustering of gene expression results in distinct co-expression gene networks across the three datasets (varying in levels of inflation and cluster size), as detailed here.

Note: The 'efficiency' of the MCL clustering does not indicate a single 'best' cluster, as all clusterings appear to be at least acceptable. They do, however, help illustrate the relative advantages of each clustering.

As clustering for the 10 inflation values resulted in distinct co-expression gene modules across the three datasets, we decided to develop an approach to identify highly conserved modules across all inflation values – modules where the gene set remains similar regardless of the inflation value used.

To identify conserved modules, I opted to analyze module overlap by computing the Jaccard and Overlap coefficients.

The Jaccard coefficient measures similarity between finite sample sets and is defined as the size of the intersection divided by the size of the union of the sample sets:

$J(A, B) = \frac{{|A \cap B|}}{{|A \cup B|}}$

Some modules are quite large, and consequently, the Jaccard coefficient resulted in low overlap values, even when a small subset is entirely contained within a larger set.

Therefore, I also computed the overlap coefficient, as it accounts for set sizes. It is calculated by dividing the size of the intersection of sets by the size of the smaller of the two sets:

$O(A, B) = \frac{{|A \cap B|}}{{\min(|A|, |B|)}}$

Using the smaller set size as the denominator provides insight into whether one set is an exact subset of the larger set.

I developed this script to calculate the Jaccard and Overlap coefficients between the clusters generated by MCL.

With the overlap values of the modules, it's possible to identify conserved modules through clique searching.

I developed this script to use the igraph weighted cliques function to find cliques in the modules with overlap coefficient > 0.7. The result of this script is a file containing the names of the conserved clusters (cliques).

To extract genes from the conserved clusters, I developed this script. The result is one file for each conserved clique and its respective genes.

This result also includes information about the gene's relationship within the module

e.g., intersection, if the gene is part of the intersection of the modules (the gene is present in all cliques with overlap > 0.7), or;

disjoint if the gene is not present in the intersection of the modules.

These conserved co-expressed gene modules are used in subsequent steps for module expression visualization and annotation of module functions.