trinotate - trinityrnaseq/BerlinTrinityWorkshop2018 GitHub Wiki

Preparing and Generating a Trinotate Annotation Report

Generating a Trinotate annotation report involves first loading all of our bioinformatics computational results into a Trinotate SQLite database. The Trinotate software provides a boilerplate SQLite database called 'Trinotate.sqlite' that comes pre-populated with a lot of generic data about SWISSPROT records and Pfam domains. Below, we'll populate this database with all of our bioinformatics computes and our expression data.

Preparing Trinotate (loading the database)

As a sanity check, be sure you're currently located in your 'Trinotate/' working directory.

% pwd

.

/home/training/workspace/Trinotate

Trinotate stores data into an SQLite relational database. A 'boilerplate' database is available that comes preloaded with functional data from swissprot, pfam, and other data sources.

Copy the boilerplate database over to your Trinotate/ area:

%  cp ~/shared_ro/Trinotate_v3.sqlite Trinotate.sqlite

Next, load your Trinotate.sqlite database with your Trinity transcripts and predicted protein sequences:

%  /usr/local/src/Trinotate/Trinotate Trinotate.sqlite init \
   --gene_trans_map ../Trinity.fasta.gene_trans_map \
   --transcript_fasta ../Trinity.fasta \
   --transdecoder_pep Trinity.fasta.transdecoder.pep

Load in the various outputs generated earlier:

%  /usr/local/src/Trinotate/Trinotate Trinotate.sqlite \
       LOAD_swissprot_blastx swissprot.blastx.outfmt6

%  /usr/local/src/Trinotate/Trinotate Trinotate.sqlite \
       LOAD_swissprot_blastp swissprot.blastp.outfmt6

%  /usr/local/src/Trinotate/Trinotate Trinotate.sqlite LOAD_pfam TrinotatePFAM.out

%  /usr/local/src/Trinotate/Trinotate Trinotate.sqlite LOAD_tmhmm tmhmm.out

%  /usr/local/src/Trinotate/Trinotate Trinotate.sqlite LOAD_signalp signalp.out

Generate the Trinotate Annotation Report

% /usr/local/src/Trinotate/Trinotate Trinotate.sqlite report > Trinotate.xls

View the report

% less Trinotate.xls

.

#gene_id        transcript_id   sprot_Top_BLASTX_hit    RNAMMER prot_id prot_coords     sprot_Top_BLASTP_hit    Pfam    SignalP TmHMM   eggnog  Kegg    gene_ontology_blast     gene_ontology_pfam      transcript      peptide
TRINITY_DN144_c0_g1     TRINITY_DN144_c0_g1_i1  MYO2_YEAST^MYO2_YEAST^Q:4762-107,H:1-1561^76.835%ID^E:0^RecName: Full=Myosin-2;^Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces  .       TRINITY_DN144_c0_g1::TRINITY_DN144_c0_g1_i1::g.1::m.1   62-4831[-]      MYO2_YEAST^MYO2_YEAST^Q:24-1583,H:1-1569^77.016%ID^E:0^RecName: Full=Myosin-2;^Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces   PF02736.16^Myosin_N^Myosin N-terminal SH3-like domain^29-69^E:0.00035`PF00063.18^Myosin_head^Myosin head (motor domain)^95-791^E:2.5e-283`PF01843.16^DIL^DIL domain^1399-1498^E:7e-26   .       .       .       KEGG:sce:YOR326W`KO:K10357      GO:0032432^cellular_component^actin filament bundle`GO:0005935^cellular_component^cellular bud neck`GO:0005934^cellular_component^cellular bud tip`GO:0031941^cellular_component^filamentous actin`GO:0000131^cellular_component^incipient cellular bud site`GO:0043332^cellular_component^mating projection tip`GO:0071563^cellular_component^Myo2p-Vac17p-Vac8p transport complex`GO:0016459^cellular_component^myosin complex`GO:0030133^cellular_component^transport vesicle`GO:0031982^cellular_component^vesicle`GO:0051015^molecular_function^actin filament binding`GO:0005524^molecular_function^ATP binding`GO:0005516^molecular_function^calmodulin binding`GO:0000146^molecular_function^microfilament motor activity`GO:0007118^biological_process^budding cell apical bud growth`GO:0000132^biological_process^establishment of mitotic spindle orientation`GO:0048313^biological_process^Golgi inheritance`GO:0007107^biological_process^membrane addition at site of cytokinesis`GO:0000001^biological_process^mitochondrion inheritance`GO:0045033^biological_process^peroxisome inheritance`GO:0015031^biological_process^protein transport`GO:0009826^biological_process^unidimensional cell growth`GO:0000011^biological_process^vacuole inheritance`GO:0030050^biological_process^vesicle transport along actin filament`GO:0016192^biological_process^vesicle-mediated transport  GO:0003774^molecular_function^motor activity`GO:0005524^molecular_function^ATP binding`GO:0016459^cellular_component^myosin complex     .       .
TRINITY_DN144_c0_g2     TRINITY_DN144_c0_g2_i1  MYO2_YEAST^MYO2_YEAST^Q:2218-107,H:850-1561^69.638%ID^E:0^RecName: Full=Myosin-2;^Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces        .       .       .       .       .       .       .       .    KEGG:sce:YOR326W`KO:K10357      GO:0032432^cellular_component^actin filament bundle`GO:0005935^cellular_component^cellular bud neck`GO:0005934^cellular_component^cellular bud tip`GO:0031941^cellular_component^filamentous actin`GO:0000131^cellular_component^incipient cellular bud site`GO:0043332^cellular_component^mating projection tip`GO:0071563^cellular_component^Myo2p-Vac17p-Vac8p transport complex`GO:0016459^cellular_component^myosin complex`GO:0030133^cellular_component^transport vesicle`GO:0031982^cellular_component^vesicle`GO:0051015^molecular_function^actin filament binding`GO:0005524^molecular_function^ATP binding`GO:0005516^molecular_function^calmodulin binding`GO:0000146^molecular_function^microfilament motor activity`GO:0007118^biological_process^budding cell apical bud growth`GO:0000132^biological_process^establishment of mitotic spindle orientation`GO:0048313^biological_process^Golgi inheritance`GO:0007107^biological_process^membrane addition at site of cytokinesis`GO:0000001^biological_process^mitochondrion inheritance`GO:0045033^biological_process^peroxisome inheritance`GO:0015031^biological_process^protein transport`GO:0009826^biological_process^unidimensional cell growth`GO:0000011^biological_process^vacuole inheritance`GO:0030050^biological_process^vesicle transport along actin filament`GO:0016192^biological_process^vesicle-mediated transport  .       .       .
TRINITY_DN179_c0_g1     TRINITY_DN179_c0_g1_i1  SNF6_YEAST^SNF6_YEAST^Q:352-843,H:79-229^32.927%ID^E:4.87e-16^RecName: Full=Transcription regulatory protein SNF6;^Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces`SNF6_YEAST^SNF6_YEAST^Q:1252-1338,H:304-332^72.414%ID^E:2.29e-08^RecName: Full=Transcription regulatory protein SNF6;^Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces  .       .       .       .       .       .       .       .       KEGG:sce:YHL025W`KO:K11773      GO:0005829^cellular_component^cytosol`GO:0005634^cellular_component^nucleus`GO:0016514^cellular_component^SWI/SNF complex`GO:0006338^biological_process^chromatin remodeling`GO:0006289^biological_process^nucleotide-excision repair`GO:0045944^biological_process^positive regulation of transcription from RNA polymerase II promoter`GO:0005987^biological_process^sucrose catabolic process`GO:0006351^biological_process^transcription, DNA-templated     .       .       .
TRINITY_DN159_c0_g1     TRINITY_DN159_c0_g1_i1  YL419_YEAST^YL419_YEAST^Q:488-3,H:39-202^68.902%ID^E:5.15e-75^RecName: Full=Putative ATP-dependent RNA helicase YLR419W;^Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces .       .       .       .       .       .       .       .       KEGG:sce:YLR419W        GO:0005737^cellular_component^cytoplasm`GO:0005524^molecular_function^ATP binding`GO:0004004^molecular_function^ATP-dependent RNA helicase activity`GO:0003676^molecular_function^nucleic acid binding`GO:0006396^biological_process^RNA processing     .       .      .
TRINITY_DN159_c0_g2     TRINITY_DN159_c0_g2_i1  YL419_YEAST^YL419_YEAST^Q:76-2,H:39-63^72%ID^E:1.98e-06^RecName: Full=Putative ATP-dependent RNA helicase YLR419W;^Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces       .       .       .       .        .       .       .       .       KEGG:sce:YLR419W        GO:0005737^cellular_component^cytoplasm`GO:0005524^molecular_function^ATP binding`GO:0004004^molecular_function^ATP-dependent RNA helicase activity`GO:0003676^molecular_function^nucleic acid binding`GO:0006396^biological_process^RNA processing     .       .     .
TRINITY_DN153_c0_g1     TRINITY_DN153_c0_g1_i1  SDA1_YEAST^SDA1_YEAST^Q:2334-100,H:1-767^76.295%ID^E:0^RecName: Full=Protein SDA1;^Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces       .       .       .       .       .       .       .       .    KEGG:sce:YGR245C`KO:K14856      GO:0005730^cellular_component^nucleolus`GO:0005634^cellular_component^nucleus`GO:0030036^biological_process^actin cytoskeleton organization`GO:0007049^biological_process^cell cycle`GO:0015031^biological_process^protein transport`GO:0042273^biological_process^ribosomal large subunit biogenesis`GO:0000055^biological_process^ribosomal large subunit export from nucleus`GO:0007089^biological_process^traversing start control point of mitotic cell cycle      .       .       .
TRINITY_DN130_c0_g1     TRINITY_DN130_c0_g1_i1  HS150_YEAS6^HS150_YEAS6^Q:213-1,H:26-108^63.855%ID^E:3.71e-23^RecName: Full=Cell wall mannoprotein HSP150;^Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces       .       .       .       .       .     .       .       .       .       GO:0005618^cellular_component^cell wall`GO:0005576^cellular_component^extracellular region`GO:0005199^molecular_function^structural constituent of cell wall`GO:0071555^biological_process^cell wall organization       .       .       .

The above file can be very large. It's often useful to load it into a spreadsheet software tools such as MS-Excel. If you have a transcript identifier of interest, you can always just 'grep' to pull out the annotation for that transcript from this report. We'll use TrinotateWeb to interactively explore these data in a web browser below.

Let's use the annotation attributes for the transcripts here as 'names' for the transcripts in the Trinotate database. This will be useful later when using the TrinotateWeb framework.

%  /usr/local/src/Trinotate/util/annotation_importer/import_transcript_names.pl \
      Trinotate.sqlite Trinotate.xls

Nothing exciting to see in running the above command, but know that it's helpful for later on.

Interactively Explore Expression and Annotations in TrinotateWeb

Earlier, we generated large sets of tab-delimited files containg lots of data - annotations for transcripts, matrices of expression values, lists of differentially expressed transcripts, etc. We also generated a number of plots in PDF format. These are all useful, but they're not interactive and it's often difficult and cumbersome to extract information of interest during a study. We're developing TrinotateWeb as a web-based interactive system to solve some of these challenges. TrinotateWeb provides heatmaps and various plots of expression data, and includes search functions to quickly access information of interest. Below, we will populate some of the additional information that we need into our Trinotate database, and then run TrinotateWeb and start exploring our data in a web browser.

Populate the expression data into the Trinotate database

We should still be in our 'workspace/Trinotate' directory for the following.

Load in the transcript and gene-level expression data stored in the matrices we built earlier:

# load transcript expression data
%  /usr/local/src/Trinotate/util/transcript_expression/import_expression_and_DE_results.pl \
        --sqlite Trinotate.sqlite \
        --transcript_mode \
        --samples_file ../samples.txt \
        --count_matrix ../Trinity_trans.isoform.counts.matrix \
        --fpkm_matrix ../Trinity_trans.isoform.TMM.EXPR.matrix

# load gene expression data
%  /usr/local/src/Trinotate/util/transcript_expression/import_expression_and_DE_results.pl \
        --sqlite Trinotate.sqlite \
        --gene_mode \
        --samples_file ../samples.txt \
        --count_matrix ../Trinity_trans.gene.counts.matrix \
        --fpkm_matrix ../Trinity_trans.gene.TMM.EXPR.matrix

Import the DE results from our edgeR result directories:

# load transcript DE data:
%  /usr/local/src/Trinotate/util/transcript_expression/import_expression_and_DE_results.pl \
       --sqlite Trinotate.sqlite \
       --transcript_mode \
       --samples_file ../samples.txt \
       --DE_dir ../edgeR_trans

# load gene DE data:
%  /usr/local/src/Trinotate/util/transcript_expression/import_expression_and_DE_results.pl \
       --sqlite Trinotate.sqlite \
       --gene_mode \
       --samples_file ../samples.txt \
       --DE_dir ../edgeR_genes

At this point, the Trinotate database should be fully populated and ready to be used by TrinotateWeb.

Launch and Surf TrinotateWeb

TrinotateWeb is web-based software and runs locally on the same hardware we've been running all our computes (as opposed to your typical websites that you visit regularly, such as Facebook).

Visit your TrinotateWeb portal from your existing apache connection like so:

http://${YOUR_IP_ADDRESS}:${YOUR_PORT_NUMBER}/cgi-bin/index.cgi

You should see a web form like so:

In the text box, put the path to your Trinotate.sqlite database, as shown above ("/home/training/workspace/Trinotate/Trinotate.sqlite"). Click 'Submit'.

You should now have TrinotateWeb running and serving the content in your Trinotate database:

Take some time to click the various tabs and explore what's available.

eg. Under 'Annotation Keyword Search', search for 'transporter'

eg. Under 'Differential Expression', examine your earlier-defined transcript clusters. Also, launch MA or Volcano plots to explore the DE data.

We will explore TrinotateWeb functionality together as a group.

Navigating expression clusters in TrinotateWeb

Import these clusters of transcripts into the Trinotate database like so:

%  /usr/local/src/Trinotate/util/transcript_expression/import_transcript_clusters.pl \
      --sqlite Trinotate.sqlite \
      --group_name DE_all_vs_all \
      --analysis_name diffExpr.P1e-3_C2.matrix.RData.clusters_fixed_P_60 \
         ../edgeR_trans/diffExpr.P1e-3_C2.matrix.RData.clusters_fixed_P_60/*matrix

Now, if you revisit your TrinotateWeb portal, you should find an entry for this set of clusters having been loaded and available for inspection.