functional_enrichment - trinityrnaseq/BerlinTrinityWorkshop2018 GitHub Wiki
GOseq to Compute Functional Enrichment of Gene Ontology Categories
The Trinotate report above contains Gene Ontology identifiers mapped to transcripts. These are derived from SWISSPROT records of matching proteins as well as assigned according to Pfam domain content. The Gene Ontology structured classifications will allow us to explore whether we have statistically significant functional enrichment based on gene sets - such as for those genes that were identified to be differentially expressed.
While still in the edgeR_trans/ directory (verify this by 'pwd'), do the following.
Generate a listing of all GO identifiers assigned to each transcript like so:
% /usr/local/src/Trinotate/util/extract_GO_assignments_from_Trinotate_xls.pl \
--Trinotate_xls ../Trinotate/Trinotate.xls -T -I > Trinotate.xls.gene_ontology
If you see warning messages 'cannot parse parent info', simply ignore them.
and take a look at the output file:
% less Trinotate.xls.gene_ontology
.
TRINITY_DN101_c0_g1_i1 GO:0003674,GO:0003824,GO:0005488,GO:0005506,GO:0005575,GO:0005759,GO:0005829,GO:0006810,GO:0006950,GO:0006979,GO:0008150,GO:0008941,GO:0009636,GO:0009987,GO:0015669,GO:0015671,GO:0016491,GO:0016661,GO:0016662,GO:0016705,GO:0016708,GO:0016966,GO:0019825,GO:0020037,GO:0031974,GO:0033554,GO:0034599,GO:0042221,GO:0043167,GO:0043169,GO:0043233,GO:0044422,GO:0044424,GO:0044429,GO:0044444,GO:0044446,GO:0044464,GO:0044699,GO:0044765,GO:0046872,GO:0046906,GO:0046914,GO:0050896,GO:0051179,GO:0051213,GO:0051234,GO:0051716,GO:0070013,GO:0070887,GO:0097159,GO:1901363,GO:1902578
TRINITY_DN102_c0_g1_i1 GO:0002181,GO:0003674,GO:0003735,GO:0005198,GO:0005575,GO:0005840,GO:0006412,GO:0008150,GO:0008152,GO:0009058,GO:0009059,GO:0009987,GO:0015934,GO:0019538,GO:0022625,GO:0030529,GO:0032991,GO:0034645,GO:0043170,GO:0043226,GO:0043228,GO:0043229,GO:0043232,GO:0044237,GO:0044238,GO:0044249,GO:0044260,GO:0044267,GO:0044391,GO:0044422,GO:0044424,GO:0044444,GO:0044445,GO:0044446,GO:0044464,GO:0071704,GO:1901576
TRINITY_DN103_c0_g1_i1 GO:0003674,GO:0003824,GO:0004130,GO:0004601,GO:0005488,GO:0005575,GO:0005759,GO:0005829,GO:0006950,GO:0006979,GO:0008150,GO:0016209,GO:0016491,GO:0016684,GO:0020037,GO:0031974,GO:0043167,GO:0043169,GO:0043233,GO:0044422,GO:0044424,GO:0044429,GO:0044444,GO:0044446,GO:0044464,GO:0046872,GO:0046906,GO:0050896,GO:0070013,GO:0097159,GO:1901363
...
Now that we have our list of transcript-to-GO assignment mappings, let's now explore functional enrichments of our DE transcripts.
We'll use the GOseq software to test for GO enrichment.
First, let's return to the edgeR_trans/ directory where our DE results exist:
% cd ~/edgeR_trans
One of the key merits of GOseq is that it takes into account biases in transcript feature lengths according to functional categories (some functional categories have many long or short proteins, which biases their representation as detected in DE data). Generate a file containing the lengths of each of our assembled transcripts like so:
% $TRINITY_HOME/util/misc/fasta_seq_length.pl \
../Trinity.fasta > Trinity.seqLengths
Take a glance at the file we just generated:
% less Trinity.seqLengths
.
fasta_entry length
TRINITY_DN144_c0_g1_i1 199
TRINITY_DN179_c0_g1_i1 169
TRINITY_DN159_c0_g1_i1 524
TRINITY_DN153_c0_g1_i1 182
TRINITY_DN130_c0_g1_i1 254
Now, run GOseq on those transcripts that were defined as differentially expressed according to our earlier thresholds
% $TRINITY_HOME/Analysis/DifferentialExpression/analyze_diff_expr.pl \
--matrix ../Trinity_trans.TMM.EXPR.matrix \
--samples ../samples.txt \
-P 1e-3 -C 2 \
--examine_GO_enrichment \
--GO_annots Trinotate.xls.gene_ontology --gene_lengths Trinity.seqLengths
See which files were just generated as a result:
% ls -ltr
You should see GOseq.enriched and GOseq.depleted files for each of the sets of significantly upregulated transcripts. If the set of upregulated transcripts is functionally enriched for certain Gene Ontology terms, they'll be reported in the .enriched file with significant enrichment FDR values. If that same set of upregulated transcripts happens to be significantly depleted for certain functions, these functions will be identified in the corresponding .depleted file.
Functional enrichment at the gene level
It's better to do the functional enrichment at the gene level to avoid biasing the statistics towards genes with many alternatively spliced isoforms.
To do so, enter your edgeR_genes/ directory.
% cd ~/workspace/edgeR_genes/
Extract the GO terms based on the gene identifiers:
% /usr/local/src/Trinotate/util/extract_GO_assignments_from_Trinotate_xls.pl \
--Trinotate_xls ../Trinotate/Trinotate.xls -G -I > Trinotate.xls.gene_ontology
Create a 'gene lengths' file as the expression-weighted average of transcript isoform lengths like so:
% ${TRINITY_HOME}/util/misc/TPM_weighted_gene_length.py \
--gene_trans_map ../Trinity.fasta.gene_trans_map \
--trans_lengths ../edgeR_trans/Trinity.seqLengths \
--TPM_matrix ../Trinity_trans.TMM.EXPR.matrix > Trinity.gene_lengths.txt
Now you can run the GOseq enrichment analysis:
% $TRINITY_HOME/Analysis/DifferentialExpression/analyze_diff_expr.pl \
--matrix ../Trinity_genes.TMM.EXPR.matrix \
--samples ../samples.txt \
-P 1e-3 -C 2 \
--examine_GO_enrichment \
--GO_annots Trinotate.xls.gene_ontology \
--gene_lengths Trinity.gene_lengths.txt
Do results concur with previously published study?
In the Linde et al. "Defining the transcriptomic landscape of Candida glabrata by RNA-Seq" paper they mention "Within the GO category ‘heme binding’, 11 of 20 genes are DEGs (adjusted P-value = 6.19 × 10−3) indicating that C. glabrata cells reorder their iron homoeostasis in response to pH changes."
Examining the various comparisons generating the "*UP.subset.GOseq.enriched" files, do you find 'heme' to be a category enriched? Is the 'heme' category up-regulated or down-regulated in the ph8 condition?
Also, in the paper they mention "During nitrosative stress ... One of the most significantly enriched molecular functions is ‘RNA polymerase III activity’ ... which indicates that fungal cells strongly and quickly adapt to the new environment."
Did you observe similar findings?
Epilogue
If you've gotten this far, hurray!!! Congratulations!!! You've now experienced the full tour of Trinity and TrinotateWeb. Visit our web documentation at http://trinityrnaseq.github.io, and join our Google group to become part of the ever-growing Trinity user community.