VirusGenesAnnotation - BGIGPD/BestPractices4Pathogenomics GitHub Wiki

Workshop:Virus genes annotation

Overview

Viruses, despite their great abundance and significance in biological systems, remain largely mysterious. Indeed, the vast majority of the perhaps hundreds of millions of viral species on the planet remain undiscovered. Additionally, many viruses deposited in central databases like GenBank and RefSeq are littered with genes annotated as ‘hypothetical protein’ or the equivalent[1], a efficient virus discovery and annotation tool is needed to discover familiar or divergent viral sequences from user-input contigs. Cenote-Taker 3, uses a flexible set of modules to automatically annotate the sequence features of contigs, providing more gene information than comparable tools. The outputs include readable and interactive genome maps, virome summary tables, and files that can be directly submitted to GenBank, which facilitate virus discovery, annotation, and expansion of the known virome. Visit Cenote-Taker3 to get more information about it.

image

Cenote-Taker 3 is a virus bioinformatics tool that scales from individual genomes sequences to massive metagenome assemblies to:

Identify sequences containing genes specific to viruses (virus hallmark genes)

Annotate virus sequences including:

---a) adaptive ORF calling

---b) a large catalog of HMMs from virus gene families for functional annotation

---c) Hierarchical taxonomy assignment based on hallmark genes

---d) mmseqs2-based CDD database search

---e) tabular (.tsv) and interactive genome map (.gbf) outputs

Objectives

  • Learn how to download and use Cenote-Taker3 to annotate genome.

  • Learn how to visulize the gene annotation results.

Softwares and Databases

Cenote-Taker3 and its build-in databases

Winscp (for windows)/FileZilla (for Mac)

Geneious (Download from:https://www.geneious.com/updates

Steps

Install Cenote-Taker3

  • Using conda or mamba (recommended):

    Cenote-Taker3 has some scrict prerequisites of dependent softwares or tools, so we create a new environment for it

  1. Use conda/mamba to install the bioconda package
conda create -n ct3_env -c conda-forge -c bioconda cenote-taker3=3.3.2
  1. Activate the conda enviroment
conda activate ct3_env
  1. Change to a directory where you'd like to install databases and run database script, specify DB directory with -o.
get_ct3_dbs -o ct3_DBs --hmm T --hallmark_tax T --refseq_tax  T --mmseqs_cdd T --domain_list T
  1. Set the database directory as a conda environmental variable.
conda env config vars set CENOTE_DBS=/home/zhaohailong/ct3_DBs

Note: The total database file size is 3.0 GB after decompression, making it prone to interruptions during download and requiring several hours to complete. To save time, you can directly use my environment, which is already fully configured.

conda activate /home/zhaohailong/miniconda3/envs/ct3_env

Running Cenote-Taker 3

Make sure conda environment is activated

Help Menu

You can see all the parameters and their explainations with -h

cenotetaker3 -h

Run with your contigs data

Use your own data or the demo data of covid genomes

cenotetaker3 -c /home/zhaohailong/Cenote-Taker3/covid-seq/all.fa -r my_meta_ct3 -p T

-c your input file in fasta format

-r Name of this run. A directory of this name will be created

-p prune_prophage

If the software executes successfully, you’ll see the information displayed in your terminal:

this script dir: /home/zhaohailong/miniconda3/envs/ct3_env/lib/python3.12/site-packages/cenote
FASTA checked.
my_meta_ct3
time update: configuring run directory  10-25-24---19:24:49
Your specified arguments:
Cenote-Taker version:              3.3.2
original contigs:                  Cenote-Taker3/covid-seq/all.fa
title of this run:                 my_meta_ct3
output directory:                  /home/zhaohailong/my_meta_ct3
Prune prophages?                   True
CPUs used for run:                 104
Annotation only?                   False
minimum circular contig length:    1000
minimum linear contig length:      1000
virus hallmark type(s) to count:   virion rdrp
min. viral hallmarks for linear:   1
min. viral hallmarks for circular: 1
Wrap contigs?                      True
HMM db version                     v3.1.1
ORF Caller:                        prodigal-gv
Cenote DBs directory:              /home/zhaohailong/ct3_DBs
Cenote scripts directory:          /home/zhaohailong/miniconda3/envs/ct3_env/lib/python3.12/site-packages/cenote
Template file:                     /home/zhaohailong/miniconda3/envs/ct3_env/lib/python3.12/site-packages/cenote/dummy_template.sbt
read file(s):                      none
HHsuite tool:                      none
Taxonomy DB:                       ct3_hallmark.taxDB
Sequencing Technology:             Illumina
Max seq length to assess DTRs:     1000000
 
time update: running pyrodigal on all contigs  10-25-24---19:24:49
pyrodigal part finished in 0.31 seconds
time update: running pyhmmer on all ORFs  10-25-24---19:24:52
...
...
...
time update: Making virus summary table 10-25-24---19:27:53
5 contigs over 1000 nt were searched and 5 viruses were detected and annotated.
In all, 90% of virus genes were annotated with functional information.
5 contig(s) over 10kb went through pruning module and 5 were shortened by pruning.

Cenote-Taker finishing now 10-25-24---19:27:55
output: my_meta_ct3
This Cenote-Taker run finished in 0:03:10.120000

Output of Cenote-Taker 3

{run_title}/
|   {run_title}_virus_summary.tsv                 <- main summary file for each virus
|   {run_title}_virus_sequences.fna               <- all virus genome seqs
|   {run_title}_virus_AA.faa                      <- all virus AA seqs
|   {run_title}_prune_summary.tsv                 <- summary of pruning of each sequence
|   final_genes_to_contigs_annotation_summary.tsv <- annotation info, all genes
|   run_arguments.txt                             <- arguments used in this run
│   {run_title}_cenotetaker.log                   <- main log file
│
└───sequin_and_genome_maps/
│   │   {run_title}*gbf                           <- genome maps
│   │   {run_title}*fsa                           <- genome sequence
│   │   {run_title}*gtf                           <- feature table gtf format
│   │   {run_title}*tbl                           <- feature table sequin format
│   │   {run_title}*sqn                           <- non-human-readable sequin file for GenBank sub
│   │   {run_title}*cmt                           <- sequin comment file
│
└───ct_processing/
    │   --- many intermediate files ---`

Review the file content and understand its meaning.

less -S my_meta_ct3_virus_summary.tsv

For better human readability, we can use the following code

column -t -s $'\t' my_meta_ct3_virus_summary.tsv | less -S

Visulize genome map output file *.gbf

  1. Transfer your *.gbf file on BGI server to your local laptop

(1)Login into bastion ( https://uomc.genomics.cn/shterm/#/business/resourceaccess ) on your web browser

(2) Use file transfer tools (FileZilla for Mac, Winscp for Windows) to transfer files between your local computer and Server.

The information displayed varies across different versions; the paid version shows more details, while the free version has limited information.

  1. Visulize the *.gbf in Geneious Also there are some other tools such as UCSC Genome Browser, IGV, SnapGene Viewer can do this, but some of them require converting the file format first.

Reference

[1] Michael J Tisza, Anna K Belford, Guillermo Domínguez-Huerta, Benjamin Bolduc, Christopher B Buck, Cenote-Taker 2 democratizes virus discovery and sequence annotation, Virus Evolution, Volume 7, Issue 1, January 2021, veaa100, https://doi.org/10.1093/ve/veaa100