functional_enrichment - trinityrnaseq/BerlinTrinityWorkshop2018 GitHub Wiki

GOseq to Compute Functional Enrichment of Gene Ontology Categories

The Trinotate report above contains Gene Ontology identifiers mapped to transcripts. These are derived from SWISSPROT records of matching proteins as well as assigned according to Pfam domain content. The Gene Ontology structured classifications will allow us to explore whether we have statistically significant functional enrichment based on gene sets - such as for those genes that were identified to be differentially expressed.

While still in the edgeR_trans/ directory (verify this by 'pwd'), do the following.

Generate a listing of all GO identifiers assigned to each transcript like so:

% /usr/local/src/Trinotate/util/extract_GO_assignments_from_Trinotate_xls.pl \
   --Trinotate_xls ../Trinotate/Trinotate.xls -T -I > Trinotate.xls.gene_ontology

If you see warning messages 'cannot parse parent info', simply ignore them.

and take a look at the output file:

% less Trinotate.xls.gene_ontology

.

TRINITY_DN101_c0_g1_i1  GO:0003674,GO:0003824,GO:0005488,GO:0005506,GO:0005575,GO:0005759,GO:0005829,GO:0006810,GO:0006950,GO:0006979,GO:0008150,GO:0008941,GO:0009636,GO:0009987,GO:0015669,GO:0015671,GO:0016491,GO:0016661,GO:0016662,GO:0016705,GO:0016708,GO:0016966,GO:0019825,GO:0020037,GO:0031974,GO:0033554,GO:0034599,GO:0042221,GO:0043167,GO:0043169,GO:0043233,GO:0044422,GO:0044424,GO:0044429,GO:0044444,GO:0044446,GO:0044464,GO:0044699,GO:0044765,GO:0046872,GO:0046906,GO:0046914,GO:0050896,GO:0051179,GO:0051213,GO:0051234,GO:0051716,GO:0070013,GO:0070887,GO:0097159,GO:1901363,GO:1902578
TRINITY_DN102_c0_g1_i1  GO:0002181,GO:0003674,GO:0003735,GO:0005198,GO:0005575,GO:0005840,GO:0006412,GO:0008150,GO:0008152,GO:0009058,GO:0009059,GO:0009987,GO:0015934,GO:0019538,GO:0022625,GO:0030529,GO:0032991,GO:0034645,GO:0043170,GO:0043226,GO:0043228,GO:0043229,GO:0043232,GO:0044237,GO:0044238,GO:0044249,GO:0044260,GO:0044267,GO:0044391,GO:0044422,GO:0044424,GO:0044444,GO:0044445,GO:0044446,GO:0044464,GO:0071704,GO:1901576
TRINITY_DN103_c0_g1_i1  GO:0003674,GO:0003824,GO:0004130,GO:0004601,GO:0005488,GO:0005575,GO:0005759,GO:0005829,GO:0006950,GO:0006979,GO:0008150,GO:0016209,GO:0016491,GO:0016684,GO:0020037,GO:0031974,GO:0043167,GO:0043169,GO:0043233,GO:0044422,GO:0044424,GO:0044429,GO:0044444,GO:0044446,GO:0044464,GO:0046872,GO:0046906,GO:0050896,GO:0070013,GO:0097159,GO:1901363
...

Now that we have our list of transcript-to-GO assignment mappings, let's now explore functional enrichments of our DE transcripts.

We'll use the GOseq software to test for GO enrichment.

First, let's return to the edgeR_trans/ directory where our DE results exist:

% cd ~/edgeR_trans

One of the key merits of GOseq is that it takes into account biases in transcript feature lengths according to functional categories (some functional categories have many long or short proteins, which biases their representation as detected in DE data). Generate a file containing the lengths of each of our assembled transcripts like so:

% $TRINITY_HOME/util/misc/fasta_seq_length.pl \
     ../Trinity.fasta > Trinity.seqLengths

Take a glance at the file we just generated:

% less Trinity.seqLengths

.

fasta_entry             length
TRINITY_DN144_c0_g1_i1  199
TRINITY_DN179_c0_g1_i1  169
TRINITY_DN159_c0_g1_i1  524
TRINITY_DN153_c0_g1_i1  182
TRINITY_DN130_c0_g1_i1  254

Now, run GOseq on those transcripts that were defined as differentially expressed according to our earlier thresholds

%  $TRINITY_HOME/Analysis/DifferentialExpression/analyze_diff_expr.pl \
      --matrix ../Trinity_trans.TMM.EXPR.matrix \
      --samples ../samples.txt \
      -P 1e-3 -C 2 \
      --examine_GO_enrichment \
      --GO_annots Trinotate.xls.gene_ontology --gene_lengths Trinity.seqLengths

See which files were just generated as a result:

% ls -ltr

You should see GOseq.enriched and GOseq.depleted files for each of the sets of significantly upregulated transcripts. If the set of upregulated transcripts is functionally enriched for certain Gene Ontology terms, they'll be reported in the .enriched file with significant enrichment FDR values. If that same set of upregulated transcripts happens to be significantly depleted for certain functions, these functions will be identified in the corresponding .depleted file.

Functional enrichment at the gene level

It's better to do the functional enrichment at the gene level to avoid biasing the statistics towards genes with many alternatively spliced isoforms.

To do so, enter your edgeR_genes/ directory.

% cd ~/workspace/edgeR_genes/

Extract the GO terms based on the gene identifiers:

% /usr/local/src/Trinotate/util/extract_GO_assignments_from_Trinotate_xls.pl \
   --Trinotate_xls ../Trinotate/Trinotate.xls -G -I > Trinotate.xls.gene_ontology

Create a 'gene lengths' file as the expression-weighted average of transcript isoform lengths like so:

% ${TRINITY_HOME}/util/misc/TPM_weighted_gene_length.py \
     --gene_trans_map ../Trinity.fasta.gene_trans_map \
     --trans_lengths ../edgeR_trans/Trinity.seqLengths \
     --TPM_matrix ../Trinity_trans.TMM.EXPR.matrix  > Trinity.gene_lengths.txt

Now you can run the GOseq enrichment analysis:

%  $TRINITY_HOME/Analysis/DifferentialExpression/analyze_diff_expr.pl \
        --matrix ../Trinity_genes.TMM.EXPR.matrix  \
        --samples ../samples.txt  \
        -P 1e-3 -C 2   \
        --examine_GO_enrichment  \
        --GO_annots Trinotate.xls.gene_ontology \
        --gene_lengths Trinity.gene_lengths.txt

Do results concur with previously published study?

In the Linde et al. "Defining the transcriptomic landscape of Candida glabrata by RNA-Seq" paper they mention "Within the GO category ‘heme binding’, 11 of 20 genes are DEGs (adjusted P-value = 6.19 × 10−3) indicating that C. glabrata cells reorder their iron homoeostasis in response to pH changes."

Examining the various comparisons generating the "*UP.subset.GOseq.enriched" files, do you find 'heme' to be a category enriched? Is the 'heme' category up-regulated or down-regulated in the ph8 condition?

Also, in the paper they mention "During nitrosative stress ... One of the most significantly enriched molecular functions is ‘RNA polymerase III activity’ ... which indicates that fungal cells strongly and quickly adapt to the new environment."

Did you observe similar findings?

Epilogue

If you've gotten this far, hurray!!! Congratulations!!! You've now experienced the full tour of Trinity and TrinotateWeb. Visit our web documentation at http://trinityrnaseq.github.io, and join our Google group to become part of the ever-growing Trinity user community.