Software installation and data required - Trinotate/Trinotate GitHub Wiki

1. Software Required

If you can use Docker or Singularity, no software installations required (except TmHMM and SignalP due to licensing restrictions). See the Trinotate Docker/Singularity guide for more info.

Trinotate

Download Trinotate here
Trinity

Trinity includes support for expression and DE analysis using RSEM and Bioconductor (download here)

Note, Trinity is not absolutely required. It is possible to use Trinotate with other sources of transcript data as long as suitable inputs are available - see Inputs to Trinotate.

TransDecoder for predicting coding regions in transcripts

Download TransDecoder here
SQLite (required for database integration):

http://www.sqlite.org/
NCBI BLAST+: Blast database Homology Search (Download latest version here).

http://www.ncbi.nlm.nih.gov/books/NBK52640/

Diamond Blast: Ultra-fast alternative to NCBI BLAST
https://github.com/bbuchfink/diamond

Below are optional but recommended:

HMMER/PFAM Protein Domain Identification:

http://hmmer.org/
Infernal for Noncoding RNA Identification:

http://eddylab.org/infernal/
signalP v6 (free academic download)

https://services.healthtech.dtu.dk/service.php?SignalP
tmhmm v2 (free academic download)

Note, the latest version of TmHMM is DeepTMHMM: https://dtu.biolib.com/DeepTMHMM and Trinotate is compatible with this as well, but installation is far more complex and execution more time consuming than tmhmm v2, so we will continue to support tmhmm v2 for now.

https://services.healthtech.dtu.dk/service.php?TMHMM-2.0

Edit the header lines of the scripts `tmhmm` and `tmhmmformat.pl` to read exactly as:

    #!/usr/bin/env perl

Then, edit line 33 of tmhmm:
   #$opt_basedir = "/usr/cbs/packages/tmhmm/2.0c/tmhmm-2.0c/";
as:
   $opt_basedir = "/path/to/your/directory/containing/basedir/tmhmm-2.0c/"

removing the comment '#' and setting the path to the directory where you installed the software.

2. Sequence Databases Required

Trinotate relies heavily on SwissProt and Pfam, and custom protein files are generated as described below to be specifically used with Trinotate. You can obtain the protein database files by running this Trinotate build process. This step will download several data resources including the most current versions of swissprot, pfam, and other companion resources, create and populate a Trinotate boilerplate sqlite database (Trinotate.sqlite), and yield uniprot_sprot.pep file to be used with BLAST, and the Pfam-A.hmm.gz file to be used for Pfam searches. Run the build process like so:

$TRINOTATE_HOME/Trinotate --create \
                          --db myTrinotate.sqlite \
                          --trinotate_data_dir /path/to/TRINOTATE_DATA_DIR \
                          --use_diamond

where /path/to/TRINOTATE_DATA_DIR is the directory where you want all the Trinotate data resources to be installed -- and should not be your current working directory, but can be a destionation directory found within your current working directory (ie. $PWD != /path/to/TRINOTATE_DATA_DIR)

the --db myTrinotate.sqlite (or whatever you name it based on your target transcriptome) is what should be used for subsequent Trinotate commands below.

include --use_diamond if you plan to use diamond blast with Trinotate, otherwise NCBI blast+ will be used for database preparation.

and once it completes, it will create the 'myTrinotate.sqlite' database in your current working directory, and you'll find resources added to the TRINOTATE_DATA_DIR including:

uniprot_sprot.pep
Pfam-A.hmm.gz

If EggnogMapper is installed, those additional data resources will be incorporated as well.

Once the creation step completes, all required database resources will be installed in the TRINOTATE_DATA_DIR.

The Trinotate sqlite database for your current Trinotate analysis will be created according to what you specified for the --db parameter.

if you set the TRINOTATE_DATA_DIR environmental variable (eg. export TRINOTATE_DATA_DIR=/path/to/your/TRINOTATE_DATA_DIR), then you shouldn't need to specify --trinotate_data_dir in subsequent Trinotate commands requiring it, for convenience).

You will also find a boilerplate.sqlite database within the TRINOTATE_DATA_DIR. The TRINOTATE_DATA_DIR and boilerplate.sqlite can be reused for future Trinotate runs, in which case you can just copy and rename the boilerplate.sqlite database in a new working directory instead of having to rerun --create for future Trinotate runs that would leverage the same set of database resources.

3. Initialize your Trinotate sqlite database with your sequence data

The following inputs are required for Trinotate:

transcripts.fasta : your target transcriptome in fasta format
coding_seqs.pep : coding regions translated in fasta format (specific header formatting required - see below. Most use TransDecoder to generate this)
gene_to_trans_map.tsv : pairwise mappings between gene and transcript isoform identifiers

If a Trinity reconstructed transcriptome is the target, then the transcripts.fasta and gene_to_trans_map.tsv are the final products of running Trinity. The coding_seqs.pep is derived from running TransDecoder to predict coding regions within the transcripts.

ALTERNATIVELY, if you have a reference genome and an annotation in GTF or GFF3 format (eg. derived from EVidenceModeler, PASA, or other genome annotation system), you can run the included 'Trinotate_GTF_or_GFF3_annot_prep.pl' like so to generate the above three inputs:

$TRINOTATE_HOME/util/Trinotate_GTF_or_GFF3_annot_prep.pl

###########################################################################
#
# Required:
#
#  --annot <string>        GTF or GFF3-formatted annotation file (must end in gtf or gff3)
#                             and should include CDS annotations in addition to transcripts and genes.
#  
#  --genome_fa <string>    genome fasta file
#
#  --out_prefix <string>   output prefix
#
###########################################################################

Given the above three input files, initialize your Trinotate sqlite database like so:

 Trinotate --db <sqlite.db> --init \
           --gene_trans_map <file> \
           --transcript_fasta <file> \
           --transdecoder_pep <file>

4. Running Sequence Analyses

To run the sequence analyses and database searches, simply run Trinotate like so:

    Trinotate --db <sqlite.db> --CPU <int> \
               --transcript_fasta <file> \
               --transdecoder_pep <file> \
               --trinotate_data_dir /path/to/TRINOTATE_DATA_DIR
               --run "swissprot_blastp swissprot_blastx pfam signalp6 tmhmmv2 infernal EggnogMapper" \
               --use_diamond

where, under --run, the list of analyses to perform are indicated within a quoted list. Of course, the required tools for performing each analysis should be installed as per above.

setting '--run ALL' is shorthand and equivalent to the above --run list.

parameter -E or --evalue can be used to set the E-value threshold for blast searches (default: 1e-5)

When the Trinotate --run is used to perform analyses, the results are automatically loaded into the Trinotate sqlite database.

If you need to run TmHMM or signalP separately (due to licensing issues), instructions are provided separately here.

5. Generating the Trinotate Report

The Trinotate annotation report is generated using the --report parameter like so:

      Trinotate --db <sqlite.db> --report [ -E (default: 1e-5) ] 
                    [--pfam_cutoff DNC|DGC|DTC|SNC|SGC|STC (default: DNC=domain noise cutoff)] 
                    [--incl_pep] 
                    [--incl_trans]

an an example command might look like so:


     Trinotate --db myTrinotate.sqlite --report > myTrinotate.tsv

The report is a tab-delimited output with the following columns:

0       #gene_id
1       transcript_id
2       sprot_Top_BLASTX_hit
3       infernal
4       prot_id
5       prot_coords
6       sprot_Top_BLASTP_hit
7       Pfam
8       SignalP
9       TmHMM
10      eggnog
11      Kegg
12      gene_ontology_BLASTX
13      gene_ontology_BLASTP
14      gene_ontology_Pfam
15      transcript # optional, use --incl_trans
16      peptide # optional, use --incl_pep

The formatting of the data fields is somewhat intuitive and relatively easy to parse. Missing data or NULL results are indicated by '.' placeholders.

If EggnogMapper is included, the EggnogMapper results are further integrated into this tabulated output with an expanded set of columns and easily identified as derived accordingly.

Software installation and data required - Trinotate/Trinotate GitHub Wiki

1. Software Required

Trinotate

Trinity

TransDecoder for predicting coding regions in transcripts

SQLite (required for database integration):

NCBI BLAST+: Blast database Homology Search (Download latest version here).

Diamond Blast: Ultra-fast alternative to NCBI BLAST

Below are optional but recommended:

HMMER/PFAM Protein Domain Identification:

Infernal for Noncoding RNA Identification:

signalP v6 (free academic download)

tmhmm v2 (free academic download)

2. Sequence Databases Required

3. Initialize your Trinotate sqlite database with your sequence data

4. Running Sequence Analyses

5. Generating the Trinotate Report

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️