trinotate - trinityrnaseq/BerlinTrinityWorkshop2018 GitHub Wiki
Preparing and Generating a Trinotate Annotation Report
Generating a Trinotate annotation report involves first loading all of our bioinformatics computational results into a Trinotate SQLite database. The Trinotate software provides a boilerplate SQLite database called 'Trinotate.sqlite' that comes pre-populated with a lot of generic data about SWISSPROT records and Pfam domains. Below, we'll populate this database with all of our bioinformatics computes and our expression data.
Preparing Trinotate (loading the database)
As a sanity check, be sure you're currently located in your 'Trinotate/' working directory.
% pwd
.
/home/training/workspace/Trinotate
Trinotate stores data into an SQLite relational database. A 'boilerplate' database is available that comes preloaded with functional data from swissprot, pfam, and other data sources.
Copy the boilerplate database over to your Trinotate/ area:
% cp ~/shared_ro/Trinotate_v3.sqlite Trinotate.sqlite
Next, load your Trinotate.sqlite database with your Trinity transcripts and predicted protein sequences:
% /usr/local/src/Trinotate/Trinotate Trinotate.sqlite init \
--gene_trans_map ../Trinity.fasta.gene_trans_map \
--transcript_fasta ../Trinity.fasta \
--transdecoder_pep Trinity.fasta.transdecoder.pep
Load in the various outputs generated earlier:
% /usr/local/src/Trinotate/Trinotate Trinotate.sqlite \
LOAD_swissprot_blastx swissprot.blastx.outfmt6
% /usr/local/src/Trinotate/Trinotate Trinotate.sqlite \
LOAD_swissprot_blastp swissprot.blastp.outfmt6
% /usr/local/src/Trinotate/Trinotate Trinotate.sqlite LOAD_pfam TrinotatePFAM.out
% /usr/local/src/Trinotate/Trinotate Trinotate.sqlite LOAD_tmhmm tmhmm.out
% /usr/local/src/Trinotate/Trinotate Trinotate.sqlite LOAD_signalp signalp.out
Generate the Trinotate Annotation Report
% /usr/local/src/Trinotate/Trinotate Trinotate.sqlite report > Trinotate.xls
View the report
% less Trinotate.xls
.
#gene_id transcript_id sprot_Top_BLASTX_hit RNAMMER prot_id prot_coords sprot_Top_BLASTP_hit Pfam SignalP TmHMM eggnog Kegg gene_ontology_blast gene_ontology_pfam transcript peptide
TRINITY_DN144_c0_g1 TRINITY_DN144_c0_g1_i1 MYO2_YEAST^MYO2_YEAST^Q:4762-107,H:1-1561^76.835%ID^E:0^RecName: Full=Myosin-2;^Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces . TRINITY_DN144_c0_g1::TRINITY_DN144_c0_g1_i1::g.1::m.1 62-4831[-] MYO2_YEAST^MYO2_YEAST^Q:24-1583,H:1-1569^77.016%ID^E:0^RecName: Full=Myosin-2;^Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces PF02736.16^Myosin_N^Myosin N-terminal SH3-like domain^29-69^E:0.00035`PF00063.18^Myosin_head^Myosin head (motor domain)^95-791^E:2.5e-283`PF01843.16^DIL^DIL domain^1399-1498^E:7e-26 . . . KEGG:sce:YOR326W`KO:K10357 GO:0032432^cellular_component^actin filament bundle`GO:0005935^cellular_component^cellular bud neck`GO:0005934^cellular_component^cellular bud tip`GO:0031941^cellular_component^filamentous actin`GO:0000131^cellular_component^incipient cellular bud site`GO:0043332^cellular_component^mating projection tip`GO:0071563^cellular_component^Myo2p-Vac17p-Vac8p transport complex`GO:0016459^cellular_component^myosin complex`GO:0030133^cellular_component^transport vesicle`GO:0031982^cellular_component^vesicle`GO:0051015^molecular_function^actin filament binding`GO:0005524^molecular_function^ATP binding`GO:0005516^molecular_function^calmodulin binding`GO:0000146^molecular_function^microfilament motor activity`GO:0007118^biological_process^budding cell apical bud growth`GO:0000132^biological_process^establishment of mitotic spindle orientation`GO:0048313^biological_process^Golgi inheritance`GO:0007107^biological_process^membrane addition at site of cytokinesis`GO:0000001^biological_process^mitochondrion inheritance`GO:0045033^biological_process^peroxisome inheritance`GO:0015031^biological_process^protein transport`GO:0009826^biological_process^unidimensional cell growth`GO:0000011^biological_process^vacuole inheritance`GO:0030050^biological_process^vesicle transport along actin filament`GO:0016192^biological_process^vesicle-mediated transport GO:0003774^molecular_function^motor activity`GO:0005524^molecular_function^ATP binding`GO:0016459^cellular_component^myosin complex . .
TRINITY_DN144_c0_g2 TRINITY_DN144_c0_g2_i1 MYO2_YEAST^MYO2_YEAST^Q:2218-107,H:850-1561^69.638%ID^E:0^RecName: Full=Myosin-2;^Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces . . . . . . . . KEGG:sce:YOR326W`KO:K10357 GO:0032432^cellular_component^actin filament bundle`GO:0005935^cellular_component^cellular bud neck`GO:0005934^cellular_component^cellular bud tip`GO:0031941^cellular_component^filamentous actin`GO:0000131^cellular_component^incipient cellular bud site`GO:0043332^cellular_component^mating projection tip`GO:0071563^cellular_component^Myo2p-Vac17p-Vac8p transport complex`GO:0016459^cellular_component^myosin complex`GO:0030133^cellular_component^transport vesicle`GO:0031982^cellular_component^vesicle`GO:0051015^molecular_function^actin filament binding`GO:0005524^molecular_function^ATP binding`GO:0005516^molecular_function^calmodulin binding`GO:0000146^molecular_function^microfilament motor activity`GO:0007118^biological_process^budding cell apical bud growth`GO:0000132^biological_process^establishment of mitotic spindle orientation`GO:0048313^biological_process^Golgi inheritance`GO:0007107^biological_process^membrane addition at site of cytokinesis`GO:0000001^biological_process^mitochondrion inheritance`GO:0045033^biological_process^peroxisome inheritance`GO:0015031^biological_process^protein transport`GO:0009826^biological_process^unidimensional cell growth`GO:0000011^biological_process^vacuole inheritance`GO:0030050^biological_process^vesicle transport along actin filament`GO:0016192^biological_process^vesicle-mediated transport . . .
TRINITY_DN179_c0_g1 TRINITY_DN179_c0_g1_i1 SNF6_YEAST^SNF6_YEAST^Q:352-843,H:79-229^32.927%ID^E:4.87e-16^RecName: Full=Transcription regulatory protein SNF6;^Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces`SNF6_YEAST^SNF6_YEAST^Q:1252-1338,H:304-332^72.414%ID^E:2.29e-08^RecName: Full=Transcription regulatory protein SNF6;^Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces . . . . . . . . KEGG:sce:YHL025W`KO:K11773 GO:0005829^cellular_component^cytosol`GO:0005634^cellular_component^nucleus`GO:0016514^cellular_component^SWI/SNF complex`GO:0006338^biological_process^chromatin remodeling`GO:0006289^biological_process^nucleotide-excision repair`GO:0045944^biological_process^positive regulation of transcription from RNA polymerase II promoter`GO:0005987^biological_process^sucrose catabolic process`GO:0006351^biological_process^transcription, DNA-templated . . .
TRINITY_DN159_c0_g1 TRINITY_DN159_c0_g1_i1 YL419_YEAST^YL419_YEAST^Q:488-3,H:39-202^68.902%ID^E:5.15e-75^RecName: Full=Putative ATP-dependent RNA helicase YLR419W;^Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces . . . . . . . . KEGG:sce:YLR419W GO:0005737^cellular_component^cytoplasm`GO:0005524^molecular_function^ATP binding`GO:0004004^molecular_function^ATP-dependent RNA helicase activity`GO:0003676^molecular_function^nucleic acid binding`GO:0006396^biological_process^RNA processing . . .
TRINITY_DN159_c0_g2 TRINITY_DN159_c0_g2_i1 YL419_YEAST^YL419_YEAST^Q:76-2,H:39-63^72%ID^E:1.98e-06^RecName: Full=Putative ATP-dependent RNA helicase YLR419W;^Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces . . . . . . . . KEGG:sce:YLR419W GO:0005737^cellular_component^cytoplasm`GO:0005524^molecular_function^ATP binding`GO:0004004^molecular_function^ATP-dependent RNA helicase activity`GO:0003676^molecular_function^nucleic acid binding`GO:0006396^biological_process^RNA processing . . .
TRINITY_DN153_c0_g1 TRINITY_DN153_c0_g1_i1 SDA1_YEAST^SDA1_YEAST^Q:2334-100,H:1-767^76.295%ID^E:0^RecName: Full=Protein SDA1;^Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces . . . . . . . . KEGG:sce:YGR245C`KO:K14856 GO:0005730^cellular_component^nucleolus`GO:0005634^cellular_component^nucleus`GO:0030036^biological_process^actin cytoskeleton organization`GO:0007049^biological_process^cell cycle`GO:0015031^biological_process^protein transport`GO:0042273^biological_process^ribosomal large subunit biogenesis`GO:0000055^biological_process^ribosomal large subunit export from nucleus`GO:0007089^biological_process^traversing start control point of mitotic cell cycle . . .
TRINITY_DN130_c0_g1 TRINITY_DN130_c0_g1_i1 HS150_YEAS6^HS150_YEAS6^Q:213-1,H:26-108^63.855%ID^E:3.71e-23^RecName: Full=Cell wall mannoprotein HSP150;^Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces . . . . . . . . . GO:0005618^cellular_component^cell wall`GO:0005576^cellular_component^extracellular region`GO:0005199^molecular_function^structural constituent of cell wall`GO:0071555^biological_process^cell wall organization . . .
The above file can be very large. It's often useful to load it into a spreadsheet software tools such as MS-Excel. If you have a transcript identifier of interest, you can always just 'grep' to pull out the annotation for that transcript from this report. We'll use TrinotateWeb to interactively explore these data in a web browser below.
Let's use the annotation attributes for the transcripts here as 'names' for the transcripts in the Trinotate database. This will be useful later when using the TrinotateWeb framework.
% /usr/local/src/Trinotate/util/annotation_importer/import_transcript_names.pl \
Trinotate.sqlite Trinotate.xls
Nothing exciting to see in running the above command, but know that it's helpful for later on.
Interactively Explore Expression and Annotations in TrinotateWeb
Earlier, we generated large sets of tab-delimited files containg lots of data - annotations for transcripts, matrices of expression values, lists of differentially expressed transcripts, etc. We also generated a number of plots in PDF format. These are all useful, but they're not interactive and it's often difficult and cumbersome to extract information of interest during a study. We're developing TrinotateWeb as a web-based interactive system to solve some of these challenges. TrinotateWeb provides heatmaps and various plots of expression data, and includes search functions to quickly access information of interest. Below, we will populate some of the additional information that we need into our Trinotate database, and then run TrinotateWeb and start exploring our data in a web browser.
Populate the expression data into the Trinotate database
We should still be in our 'workspace/Trinotate' directory for the following.
Load in the transcript and gene-level expression data stored in the matrices we built earlier:
# load transcript expression data
% /usr/local/src/Trinotate/util/transcript_expression/import_expression_and_DE_results.pl \
--sqlite Trinotate.sqlite \
--transcript_mode \
--samples_file ../samples.txt \
--count_matrix ../Trinity_trans.isoform.counts.matrix \
--fpkm_matrix ../Trinity_trans.isoform.TMM.EXPR.matrix
# load gene expression data
% /usr/local/src/Trinotate/util/transcript_expression/import_expression_and_DE_results.pl \
--sqlite Trinotate.sqlite \
--gene_mode \
--samples_file ../samples.txt \
--count_matrix ../Trinity_trans.gene.counts.matrix \
--fpkm_matrix ../Trinity_trans.gene.TMM.EXPR.matrix
Import the DE results from our edgeR result directories:
# load transcript DE data:
% /usr/local/src/Trinotate/util/transcript_expression/import_expression_and_DE_results.pl \
--sqlite Trinotate.sqlite \
--transcript_mode \
--samples_file ../samples.txt \
--DE_dir ../edgeR_trans
# load gene DE data:
% /usr/local/src/Trinotate/util/transcript_expression/import_expression_and_DE_results.pl \
--sqlite Trinotate.sqlite \
--gene_mode \
--samples_file ../samples.txt \
--DE_dir ../edgeR_genes
At this point, the Trinotate database should be fully populated and ready to be used by TrinotateWeb.
Launch and Surf TrinotateWeb
TrinotateWeb is web-based software and runs locally on the same hardware we've been running all our computes (as opposed to your typical websites that you visit regularly, such as Facebook).
Visit your TrinotateWeb portal from your existing apache connection like so:
http://${YOUR_IP_ADDRESS}:${YOUR_PORT_NUMBER}/cgi-bin/index.cgi
You should see a web form like so:
In the text box, put the path to your Trinotate.sqlite database, as shown above ("/home/training/workspace/Trinotate/Trinotate.sqlite"). Click 'Submit'.
You should now have TrinotateWeb running and serving the content in your Trinotate database:
Take some time to click the various tabs and explore what's available.
eg. Under 'Annotation Keyword Search', search for 'transporter'
eg. Under 'Differential Expression', examine your earlier-defined transcript clusters. Also, launch MA or Volcano plots to explore the DE data.
We will explore TrinotateWeb functionality together as a group.
Navigating expression clusters in TrinotateWeb
Import these clusters of transcripts into the Trinotate database like so:
% /usr/local/src/Trinotate/util/transcript_expression/import_transcript_clusters.pl \
--sqlite Trinotate.sqlite \
--group_name DE_all_vs_all \
--analysis_name diffExpr.P1e-3_C2.matrix.RData.clusters_fixed_P_60 \
../edgeR_trans/diffExpr.P1e-3_C2.matrix.RData.clusters_fixed_P_60/*matrix
Now, if you revisit your TrinotateWeb portal, you should find an entry for this set of clusters having been loaded and available for inspection.