WGCNA stage 2: Network construction - Persilian/WGCNA GitHub Wiki

Prepare the WGCNA_stage2.R script

If you have successfully installed the R-packages necessary to run the script WGCNA_stage1.R, you can leave this section in the WGCNA_stage2.R script commented.

Use the soft-threshold power determined in WGCNA_stage1.R and adapt the WGCNA_stage2.R script accordingly.

Use "pearson"-correlation, since it will compute the fit of expression profiles to a linear correlation, making the R^2-values between expression profiles more intuitive. Additionally this is more conservative than "spearman"-correlation, since hard to interpret non-linear relationships between expression profiles will not be part of the best correlation results.

Note the discussion on "signed" and "unsigned" networks. The default of the network construction command "blockwiseModules()" is "signed" networks, which means that negatively correlated gene expression profiles are not considered connected. Biologically it does however make sense to create "unsigned" networks, where negatively correlated genes are connected. Imagine transcription factors down-regulating target genes. If you have a very specific set of genes in mind, it can make sense to create "signed" networks, but for entire transcriptomes I recommend "unsigned" networks.

Depending on how much RAM you have available, adjust the "maxBlockSize" parameter, with large values demanding large amounts of RAM and increasing computational time dramatically. However, it is preferable to compute as many genes together as possible. For maxBlockSize-values around 10'000, at least 30Gb of RAM are recommended.

Choose a "minModuleSize" according to your analysis goals. The network will comprise of "clusters" (modules) of genes, whose expression profiles will correlate more among each other, than with expression profiles of genes in other modules. The larger the minimum module size, the more likely it is that genes do not actually belong together in a biological sense. While a too small minimum module size will separate genes that may belong together. A "minModuleSize = 30" is a good default value for RNAseq data, since small pathways may comprise of at least this number of genes.

Choose a "mergeCutHeight" that will output a network with a reasonable amount of modules. For example, "mergeCutHeight = 0.1" means that modules that are 90% or more similar to each other, will be merged together into a single module. Hence if you feel you have too many modules as an output, you can reduce their number at the price of resolving expression profiles individually. A "mergeCutHeight" of up to 0.25 can be reasonable.

In case you want to redo the network with different values for "minModuleSize" and, or "mergeCutHeight", you don't have to re-compute the resource-intensive topological overlap matrices (TOM files) again. Therefore, at first time network generation use "loadTOM = FALSE", but "saveTOMs = TRUE" and accordingly set the path for the TOM files with "saveTOMFileBase = /path/to/TOM/files"

Generate a weighted gene co-expression network

As a default you can use the following code (in WGCNA_stage2.R) to generate a network from RNAseq data.

net = blockwiseModules(data, power = 1, corType = "pearson", networkType = "unsigned",
                    nThreads = nCPU,
                    maxBlockSize = 10000,
                    TOMType = "unsigned", minModuleSize = 30,
                    reassignThreshold = 0, mergeCutHeight = 0.1,
                    numericLabels = TRUE, pamRespectsDendro = FALSE,
                    #loadTOM = FALSE,
                    saveTOMs = TRUE,
                    saveTOMFileBase = "Data/TOM.mergecutheight.0.1.WGCNA_course_Net1",
                    verbose = 3)

Output

In your working directory's "Data" folder you can now find the TOM files with the prefix you specified in "saveTOMFileBase". Load them from here, using "loadTOM = TRUE" and "saveTOM = FALSE" if you want to re-compute the network without the lengthy computations of the TOM files.

In the Data directory you will find the WGCNA_course_Net1.RData, containing the network data including your choices for "minModuleSize", "mergeCutHeight" and other parameters.

In the Data directory you will find a "Gene_count.txt" containing a list of the modules with the numbers of genes in each module. Module "0" is the "grey module" and represents a trashcan module with all genes that have not been assigned to any other module due to dissimilar expression profiles. For subsequent analyses, these genes can be ignored.

In Data/Gene_lists, you will find a list of all genes for each module. The modules are assigned color names by default.

In the "Plots" directory you will find visualizations of the output in the form of the Module_dendrogram and the Module-Eigengene_dendrogram. The former shows an overview of the individual genes to their module color. The latter shows a correlation between the modules' Eigengenes. An eigengene is an artificial gene expression profile representing the first principal component (PC) of each module, that is an expression profile explaining most of the variation of each module.