Genome Annotation Tools - Green-Biome-Institute/AWS GitHub Wiki
Introduction
Genome annotation is another nontrivial problem within the realm of bioinformatics. While there are varying approaches, there appears to be some overlap in the general pipeline for annotating a genome assembly. This pipeline includes
- Repeat Identification and Masking
- Structural Annotation
- Ab initio or evidence-driven gene predictors
- Post-processing of gene prediction (further analysis of exons, introns, consensus sequences, etc.)
- Functional Annotation (attaching biological information to the gene or protein sequences previously predicted)
- Homology Search
- Protein sequences analysis
- Post-processing of homology search (gene-products and their interactions, statistics, etc.)
The Tools
For each of these steps, we have a series of tools that can be used. Here we will introduce some of the more commonly-used ones as seen in the literature referenced in the resources below.
RepeatMasker
From the README:
"RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns). Sequence comparisons in RepeatMasker are performed by the program cross_match, an efficient implementation of the Smith-Waterman-Gotoh algorithm developed by Phil Green, or by WU-Blast developed by Warren Gish."
RepeatMasker and its dependencies are installed/set up on the GBI AMI. To install RepeatMasker yourself, follow the instructions here: https://www.repeatmasker.org/RepeatMasker/.
It accepts files in the .fasta
format, simply enter the command followed by the fasta file you want masked. Two commonly used options are --species <query species>
to mask the file you are submitting using a database from a pre-existing database based on a specific species and --lib [filename]
which allows you to mask the file you are submitting with a custom library/database.
ex.
RepeatMasker --species <query species> your-assembly.fasta
or
RepeatMasker --lib [filename] your-assembly.fasta
Ab Initio Gene-Prediciton with AUGUSTUS
One note before mentioning AUGUSTUS.. "While Augustus and SNAP are the most popular tools for ab initio predic- tion, they still necessitate the information of the closely related gene and genome model for screening against the newly sequenced genome." (2). So while this is a commonly used practice, it may not work for many of our de novo assemblies. More on evidence-based gene prediction after AUGUSTUS.. AUGUSTUS
From the website:
"AUGUSTUS is a program that predicts genes in eukaryotic genomic sequences... It can be used as an ab initio program, which means it bases its prediction purely on the sequence. AUGUSTUS may also incorporate hints on the gene structure coming from extrinsic sources such as EST, MS/MS, protein alignments and syntenic genomic alignments. Since version 3.0 AUGUSTUS can also predict the genes simultaneously in several aligned genomes (see README-cgp.md)."
AUGUSTUS is already installed on the GBI AMI, but to install it yourself, follow the instructions on their Github here: https://github.com/Gaius-Augustus/Augustus.
The basic command is:
augustus [parameters] --species=SPECIES queryfilename
If you want to see the current list of gene species, use
augustus [parameters] --species=help
Another example, to output the results of augustus to an output file (from (1)):
augustus --species=species.name --gff3=on genome.fasta > output.file
Evidence-Driven Gene Prediction with BRAKER
From the github:
"BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET and AUGUSTUS in novel eukaryotic genomes."
Braker is already operational on the GBI AMI. in order to download it for yourself, follow the instructions either at https://github.com/Gaius-Augustus/BRAKER#installation or in the GBI AMI documentation at https://github.com/Green-Biome-Institute/AWS/wiki/AWS-GBI-AMI-Documentation.
The command looks like:
braker.pl [OPTIONS] --genome=genome.fa {--bam=rnaseq.bam | --prot_seq=prot.fa}
where genome.fa is your assembly file and the second two options are for RNA or protein sequence data that you may have.
Some other relevant options that are said to be frequently used,
--species=sname
for creating output files relevant to a specific species,--softmasking
for when your input file is soft-masked (this is a possible result of the masking software where instead of replacing the masked nucleotides with Ns, it replaces them with a lowercase letter for the given nucleotides [A->a, G->g, etc.]),--cores
for signifying the maximum cores of your computer you would like to allocate for this analysis.
For more, use the help option:
braker.pl --help
Homology Search
"To investigate gene function or predict evolutionary associations between related sequences, newly assembled sequences are compared with gene sequences with known functions to find sequences with high homology. Tools" (1).
There are several tools that can be used here, the most common one being BLAST to query the predicted genes from our newly assembled genomes.
In order to do this for nucleotide sequences, we will use the blastn
command, which is already set up on the GBI AMI. If you want to install this yourself either go to the NCBI website and follow their instructions: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download
Resources:
(1) Kong, J., Huh, S., Won, J. I., Yoon, J., Kim, B., & Kim, K. (2019). GAAP: A Genome Assembly + Annotation Pipeline. BioMed Research International, 2019. https://doi.org/10.1155/2019/4767354
(2) Jung, H., Ventura, T., Sook Chung, J., Kim, W. J., Nam, B. H., Kong, H. J., Kim, Y. O., Jeon, M. S., & Eyun, S. Il. (2020). Twelve quick steps for genome assembly and annotation in the classroom. PLoS Computational Biology, 16(11), 1–25. https://doi.org/10.1371/journal.pcbi.1008325